Re: How much downtime do we afford for nagios?

2008-04-27 Thread Mike McGrath
On Mon, 28 Apr 2008, Nigel Jones wrote:

> On Sun, April 27, 2008 11:01 pm, Jeroen van Meeuwen wrote:
> > Nigel Jones wrote:
> >> Looking through my email, from what I can recall there are no false
> >> positives.  xen6 had to be power-cycled which caused all the other
> >> collateral notifications.
> >>
> >
> > Collateral notifications can be caught using service dependencies and
> > parent hosts. Do we currently use any?
> I believe we do, but it wouldn't have helped in this case (I've done a bit
> more digging)
>
> Half the notifications came from the external nagios instance on noc2,
> while the xen6/db alerts came from the internal nagios instance. Another
> reason why I like the current setup and don't think we should change a
> thing :)
>
> Also, the UNKNOWN alerts weren't that bad, they were a precursor to the
> box having to restarted, only in this case was the up/down alerts a little
> useless.  However, I'd sooner keep them as it because otherwise we run the
> risk of not noticing a box down immediately and get everyone under the
> moon asking "why can't I access fedoraproject.org... it's down your OS
> can't be that good".

One thing I would like implemented is event handlers.  Some things
(probably not this thing) could be handled automatically for us.

-Mike

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: How much downtime do we afford for nagios?

2008-04-27 Thread Mike McGrath
On Sun, 27 Apr 2008, Jeroen van Meeuwen wrote:

> Nigel Jones wrote:
> > Looking through my email, from what I can recall there are no false
> > positives.  xen6 had to be power-cycled which caused all the other
> > collateral notifications.
> >
>
> Collateral notifications can be caught using service dependencies and parent
> hosts. Do we currently use any?
>

There's a ticket open but no progress.

-Mike

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: How much downtime do we afford for nagios?

2008-04-27 Thread Mike McGrath

On Sun, 27 Apr 2008, susmit shannigrahi wrote:

> >  > So if a service or host is unreachable for 3 or 4 mins, we get a
> >  > notification. (However most of the cases it is false positive, due to
> >  > congestion or others).
> >  Looking through my email, from what I can recall there are no false
> >  positives.  xen6 had to be power-cycled which caused all the other
> >  collateral notifications.
>
>
> How long was it down?  Why should a normal reboot will send 23 mails?
> Reboot is not any exceptional thing. Is it?
> An alert should be when its absolutely necessary...
> it should report only  when xen6 comes up but a service does not come up..
> What do you think?
> Thanks.
>


A normal reboot shouldn't, but when its in a hung state, it takes a while
before people can get to it.

-Mike

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: How much downtime do we afford for nagios?

2008-04-27 Thread Nigel Jones
On Sun, April 27, 2008 11:01 pm, Jeroen van Meeuwen wrote:
> Nigel Jones wrote:
>> Looking through my email, from what I can recall there are no false
>> positives.  xen6 had to be power-cycled which caused all the other
>> collateral notifications.
>>
>
> Collateral notifications can be caught using service dependencies and
> parent hosts. Do we currently use any?
I believe we do, but it wouldn't have helped in this case (I've done a bit
more digging)

Half the notifications came from the external nagios instance on noc2,
while the xen6/db alerts came from the internal nagios instance. Another
reason why I like the current setup and don't think we should change a
thing :)

Also, the UNKNOWN alerts weren't that bad, they were a precursor to the
box having to restarted, only in this case was the up/down alerts a little
useless.  However, I'd sooner keep them as it because otherwise we run the
risk of not noticing a box down immediately and get everyone under the
moon asking "why can't I access fedoraproject.org... it's down your OS
can't be that good".

- Nigel
>
> Kind regards,
>
> Jeroen van Meeuwen
> -kanarip
>
> ___
> Fedora-infrastructure-list mailing list
> Fedora-infrastructure-list@redhat.com
> https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list
>


___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: How much downtime do we afford for nagios?

2008-04-27 Thread Jeroen van Meeuwen

Nigel Jones wrote:

Looking through my email, from what I can recall there are no false
positives.  xen6 had to be power-cycled which caused all the other
collateral notifications.



Collateral notifications can be caught using service dependencies and 
parent hosts. Do we currently use any?


Kind regards,

Jeroen van Meeuwen
-kanarip

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: How much downtime do we afford for nagios?

2008-04-27 Thread Nigel Jones
>>  > So if a service or host is unreachable for 3 or 4 mins, we get a
>>  > notification. (However most of the cases it is false positive, due to
>>  > congestion or others).
>>  Looking through my email, from what I can recall there are no false
>>  positives.  xen6 had to be power-cycled which caused all the other
>>  collateral notifications.
>
>
> How long was it down?  Why should a normal reboot will send 23 mails?
> Reboot is not any exceptional thing. Is it?
> An alert should be when its absolutely necessary...
> it should report only  when xen6 comes up but a service does not come up..
> What do you think?
> Thanks.
Remembering that unresponsive and down are different things it looks like
it went unresponsive ~0210 UTC (2-3 minutes before first email) - I
*think* this might have just being domU's at that point, from IRC logs it
looks like the dom0 was rebooted sometime around 0228 (potentially before
hand I do not know).

It's 1 email per checked item for down/up and I guess in perspective, it
was quite big...

IMO these reports are 'absolutely necessary' and I personally like to
check it every now and then (especially after an outage like this to see
if everything was back up (service/host overview on nagios web is handy
for this).

- Nigel
>
>
>
> --
> Regards,
> Susmit.
>
> =
> ssh
> 0x86DD170A
> http://www.fedoraproject.org/wiki/SusmitShannigrahi
> =
>
> ___
> Fedora-infrastructure-list mailing list
> Fedora-infrastructure-list@redhat.com
> https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list
>


___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: How much downtime do we afford for nagios?

2008-04-26 Thread susmit shannigrahi
>  > So if a service or host is unreachable for 3 or 4 mins, we get a
>  > notification. (However most of the cases it is false positive, due to
>  > congestion or others).
>  Looking through my email, from what I can recall there are no false
>  positives.  xen6 had to be power-cycled which caused all the other
>  collateral notifications.


How long was it down?  Why should a normal reboot will send 23 mails?
Reboot is not any exceptional thing. Is it?
An alert should be when its absolutely necessary...
it should report only  when xen6 comes up but a service does not come up..
What do you think?
Thanks.



-- 
Regards,
Susmit.

=
ssh
0x86DD170A
http://www.fedoraproject.org/wiki/SusmitShannigrahi
=

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: How much downtime do we afford for nagios?

2008-04-26 Thread Nigel Jones
> Hi,
Hi,
>
> For a few days false notification of nagios reduced. But it has increased
> again.
You sure?
>
> Looking at the /configs/system/nagios/services/template.cfg reveals
> that it is configured as
> max_check_attempt = 4 and retry_check_interval  1 for hosts
> and
>  max_check_attempts = 3 and retry_check_interval  1.
>
> So if a service or host is unreachable for 3 or 4 mins, we get a
> notification. (However most of the cases it is false positive, due to
> congestion or others).
Looking through my email, from what I can recall there are no false
positives.  xen6 had to be power-cycled which caused all the other
collateral notifications.

Just to put it into perspective...
1st notification: 0212UTC - Accounts down on .120-phx
...
5th notification: 0216UTC - UNKNOWN status on xen6 (NRPE: Unable to read
output)
...
11/12th notifications: 0228UTC - Host Down - xen6/db2
& Starting 0233UTC - Host/service UP/Okay notifications

According to my IRC logs xen6 went a bit haywire and had to be rebooted,
so TBH I don't see what is false here.


Yes congestion can cause some problems, but isn't that also a sign that
stuff may need to be balanced better or given more processing/networking
capacity.

It's long enough to not detect every single VPN bloop, but it's also long
enough to give an idea of problems.
>
> How about finding out a working delay which we can afford, if a
> service or host is really down. How about 10 mins ? (5 attempt x 2
> mins?).
IMO this is too long, also, it doesn't take that long for someone to SSH
in and have a quick look, I don't speak for everyone, but I don't mind if
I spend 2-5 minutes to check.
>
> Also we may list services/host which are critical and which are not.
> That will help to define different notification period for the
> different hots/services.
>
> I thought I shall do it after the freeze, but its becoming too annoying.
Personally, I don't think anything should be done at the moment.

- Nigel

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


How much downtime do we afford for nagios?

2008-04-26 Thread susmit shannigrahi
Hi,

For a few days false notification of nagios reduced. But it has increased again.

Looking at the /configs/system/nagios/services/template.cfg reveals
that it is configured as
max_check_attempt = 4 and retry_check_interval  1 for hosts
and
 max_check_attempts = 3 and retry_check_interval  1.

So if a service or host is unreachable for 3 or 4 mins, we get a
notification. (However most of the cases it is false positive, due to
congestion or others).

How about finding out a working delay which we can afford, if a
service or host is really down. How about 10 mins ? (5 attempt x 2
mins?).

Also we may list services/host which are critical and which are not.
That will help to define different notification period for the
different hots/services.

I thought I shall do it after the freeze, but its becoming too annoying.

Thanks


-- 
Regards,
Susmit.

=
ssh
0x86DD170A
http://www.fedoraproject.org/wiki/SusmitShannigrahi
=

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list