On Wed, Jan 5, 2011 at 11:24 AM, Laurent Guyon <laurent.gu...@adelux.fr>wrote:
> Hi,
>
Hi,
>
> We are currently deploying Shinken (with Thruk and livestatus broker
> module) for one of our customers, to try replacing and improving its
> legacy monolithic Nagios solution. The Shinken installation is nearly
> finished and tests are in progress.
>
Good :)
>
> We noticed that availability/trend reports generated with Thurk via
> Livestatus broker module contains strange values that do not reflect the
> reality.
>
Outch.
>
> Then I started to play a bit with your re-implementation of the
> Livestatus module, to look at the scheduler and broker code, and I have
> few questions about it.
>
>
> So, broker module Livestatus gets all "monitoring log
> messages" (hosts/services status changes, notifications logs...) from
> scheduler(s) when it connects to it, and stores them in its own sqlite
> database to keep all historical data necessary for generating
> availability/trends reports and more.
>
In fact only the logs are keep in the sqlite file. All others data, like
state or check results are just used to update the host/services states in
the livestatus module. such data are purely for immediate purpose.
Livestatus is a "what is the current state?" module. Only the logs are kept
if I'm not wrong.
>
> If I understand correctly the code, all pending "monitoring log
> messages" that must be sent to brokers are stored by the scheduler in
> broks instances, and queued in a simple python dict. In the same way,
> all broks received by the broker are also queued in a simple python
> dict. And :
>
Yes. There a lot of queues every where in the code in fact. It's a simple
way of managing such a lot of data :)
>
> * The scheduler queue is emptied when a broker (the first) reads the
> pending broks, so using several brokers on 1 scheduler seems impossible
> for the moment (not a so big problem for me for the moment)
>
Yes. There should be one broker in a realm (and a spare of course). From now
I think with only one we are good. Maybe if we really need in the fututre we
will add multiple queues for multiples brokers, but it will complexify the
architecture, so if we do not need it, we won't do it.
>
> * The scheduler and broker queues are only stored in memory, so if the
> scheduler or the broker stops or crashes, will all the pending broks be
> lost ? (more problematic)
>
Yes. There too much data to store them in database in fact. When the
scheduler came back, the broker ask him a full state so it goes fresh
states. The scheduler got the retention for not starting from a PENDING
environment.
>
> * When the scheduler queue is full, oldest broks start to be dropped.
>
Yes, from now there is no "back log for broks". See the idea
http://shinken.ideascale.com/a/dtd/Backlogs-for-broks-in-scheduler/76429-10373to
vote for ;)
> So if the link scheduler <-> broker is lost during a long time, it seems
> we will lose log messages and potentially very important monitoring data
> (like host/services state change). If so, nearly all the work done by
> scheduler when this link is lost will go to /dev/null (checks are done,
> but results are lost, in fact only actions made by reactionners will be
> eventually done).
>
Yes. With backlogs, it won't be a problem anymore. But it will cost a lot of
I/O if the connexion is lost for a long time of course.
>
>
> My questions/remarks :
>
> * Will the Shinken Livestatus module will be periodically upgraded to
> follow the original Livestatus progression (new features...) ?
>
Yes, we try to keep it at the same level. We are not at 100%, we are missing
some feature like WAIT for example, but we are reducing the gap. We ask the
Livestatus author if there were automated tests so we are sure to be clean,
but there are not, so it will me more difficult :)
This module is the more important "presentation" module for us. So you can
be sure it will be our main focus :)
> * What about the Livestatus sqlite database in one year, in a big
> monitoring architecture ? big database, slow queries ?
>
I let Gerhard answer to that, because I really don't know.
>
> * What about using Livestatus on top of Simple log, or perhaps storing
> the logs in a couchdb/mongodb database instead of a sqlite or any
> relationnal database ?
>
Why not, it can be an option. But the default will stay sqlite, because it's
available everywhere.
>
> * I think it's important (in a production context) to secure the broks
> exchange process between scheduler and broker to become stop-safe,
> crash-safe and link-loss-safe, because I fear the reason we get some
> curious results in our availability/trends reports is that we have lost
> some important broks (like hosts/services state changes messages) while
> restarting daemons. Is such a feature is planned on the roadmap ?
>
Yep :)
But I'm wondering it you use the good module for reporting thing. Like I
said, livestatus is not done for that, it's just a immediate view. Ther are
no real reporting module from now. You can try with NDO, but it's just slow
for huge conf :(
Such module will need indicators to be decided, and that why we really wait
to the Nareto project (nagios reporting) to startup with such values, so we
can wrote a dataware like module, that will work aside livestatus
(relational + real time monitoring is BAD! :) ).
>
>
> Thank you for all the work you've done ;)
>
You're welcome :)
Let us know if you need more informations,
Jean
>
>
> Regards,
>
> Laurent Guyon
> Adelux
>
>
>
>
> ------------------------------------------------------------------------------
> Learn how Oracle Real Application Clusters (RAC) One Node allows customers
> to consolidate database storage, standardize their database environment,
> and,
> should the need arise, upgrade to a full multi-node Oracle RAC database
> without downtime or disruption
> http://p.sf.net/sfu/oracle-sfdevnl
> _______________________________________________
> Shinken-devel mailing list
> Shinken-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/shinken-devel
>
------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and,
should the need arise, upgrade to a full multi-node Oracle RAC database
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Shinken-devel mailing list
Shinken-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/shinken-devel