[Shinken-devel] Business rule, criticity and problem/impacts : the enhanced monitoring way!

nap Wed, 22 Dec 2010 06:58:28 -0800

Hi all,


I just pushed some new features in the code :
* business rules in the core!
* business rules are seens as "impacts" for one of it's component if this
last one go in problem (or one of it's father of course)
* criticity is a new value for host/services
* min_criticity property is added to contact/notificationways : now you can
be wake up in the night by a SMS for PROD impact only
* criticity of a problem is the max value of the criticity of it's impacts!

Ok Ok. That make a lot of things, but all of theses are linked in fact :)

Let look at them one by one :

*Business rules*
Yes, now we have the business rules in the core! and it's very very easy to
setup! It's like the business process addon, but this time it's real
service/host objects so you can print them on Thruk or NagVis or whatever UI
you want :)

Let take an example. You have a huge ERP. You want to have a single "object"
that show you the state of this. All you need to do is :

define service { #yes, ti's a simple service like all others! It can be a
host too if you want
    host_name   dummy_host
    service_description    ERP
    use                          generic service
    check_command      bp_rule!(host1,db1 | host2,db2) & (2 of: host3,web1 &
host4,web2 & host5,web3) & (host5,lvs1 | host6,lvs2)
}

Here the rule is :
* you must have at least one database
* you must have at least 2 OK in the web services
* you must have at least one loadbalancer OK

And the service host1,db1 can be a business rule based service of course,
because they are real service. You can add them in servicegroup, add
dependencies or get notification from them. It's a real service. It's just
the "check" is done by the scheduler, not by a poller. That's all.

The command "bp_rule" is now a reserved one. You do not have to define it,
but you can't define one with the same name too. We will have more commands
like it in the future (like one for "loading" business process files for
example).

They are simple service/hosts, with SOFT/HARD states, check interval, and
all. So in fact you already know all you need about them :)

*Business rules are impacts*
Business rules are good, but if they are services/hosts, they should be
incorporated in the problem/impacts logic so they will be able to be show
like it in future Thruk/NagVis views (you see only "root problems", and if
you click, you see all impacts).

If you have the ERP like defined before, what if the bd1 and db2 failed? You
need to know that db1 and db2 "impact" ERP. You can set servicedep, but it's
not a good thing. you already set such "rule" in the check_command, you do
not need to do the work twice! And if you do this, you won't have
notifications for ERP if db1 and db2 failed. But maybe your boss will be
happy to have such notification...

So we need the problem/impacts without the notification bypass thing. And
it's exactly what you got automatically with bp_rule! :)

If db1 and db2 failed, ERP will be one of their "impacts", and so you will
see in the console that they are "problems" of ERP, so you know what to fix.
But notifications are still enabled for ERP, so your boss will have his
email (and so you really need to find the root problem quickly!).


*Criticity*
We take some times to get the name, but now we even got the code :)
Hosts and services can be "taggued" for criticity. It's a int value that go
from 0 (I don't care about this) to 5 (if it fall, I lost my job!). The
default value is 3. If the service do not have such value, it take the value
from it's host (implicit inheritance).

This value is already available in LiveStatus, so you should see soon a new
Thruk view that will show you in priority the elements that are really
critical first ((like the ERP, when a dev server will be on the 2nd page).
You will find "what to fix now" in a second! :)


*min_criticity filter for the contacts*
Criticity is a good thing for UIs, but it's also a good thing for
notifications. Maybe you want to have SMS the night only for really
important things. If it's a critical on a dev server, the admin won't be
happy.

The min_criticity is a new parameter on contacts/notificationways for such
things. By default it's 0 (so all is accepted by this filter). But if the
admin set it to 5 for example, he will only be notice about real important
problems, and all with the same contact definition of course! (less
contacts, less configuration, less problems !)


*Criticity is compute with the max criticity of the problem impacts
*Ok, criticity is a good thing for application admin like for the ERP. But
what for the poor network admins? If they have a lot of witches/routers with
dev/qualification/production inside. What do they set as criticity? They
should set the max value to do not take risk :(

That why in fact the criticity of a "root problem" is changed into the max
value of its impacts criticity! So the network admin do not need to touch at
his switches criticity. If it impact the ERP, it will have the criticity 5
and so he will be wake up with a SMS, if it's only dev servers behind, it
will be less, so he can still sleep and be happy the morning :)

The main idea is : it's not servers, switches, routers that are important.
Applications are, the infrastructure is not. The criticity of the
infrastructure is the criticity of the application on it. So you only need
to tag "end user" application, and then your infrastructure will have
dynamically the criticity he need :)


*An example? :*
Let take a simple example : a network admin only want to be wake up at 3AM
by a SMS if the impact of the switch problem is a huge production service
(not a little prod one, THE production).

What the admin will need to have :
* the ERP service, with a criticity of 5
* notification_way in the network admin contact that got notification period
as night (he also got another notificationway that is 24x7 by email), got
command "send-by-SMS" and the new property min_criticity 5.

If the "parents" are good for the host/switches, it's done. The admin will
only be awake by a SMS if it's the ERP that is down.

I hope we will soon have views for NagVis/Thruk about problem/impacts, and
we will have a brand new way of see the monitoring, focus around the
criticity and real world application with a lot of layers. We need to help
the admins to find the root problems, and be notified only when they really
need it. Now they will be happy, and have good nights :)

There are still things to come for the 0.5 of course, so stay tuned, and go
in your git command to test all theses new feature before the contact
downtime or escalation based on time! :)

Now I'll take a little time to wrote the doc about all of theses features,
and do some perl for Thruk views ;)


Jean

------------------------------------------------------------------------------
Forrester recently released a report on the Return on Investment (ROI) of
Google Apps. They found a 300% ROI, 38%-56% cost savings, and break-even
within 7 months.  Over 3 million businesses have gone Google with Google Apps:
an online email calendar, and document program that's accessible from your 
browser. Read the Forrester report: http://p.sf.net/sfu/googleapps-sfnew

_______________________________________________
Shinken-devel mailing list
Shinken-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/shinken-devel

[Shinken-devel] Business rule, criticity and problem/impacts : the enhanced monitoring way!

Reply via email to