[Nagios-users] Nagios Sending Notifications to Contacts not in Escalation Config (Bug?)

2011-11-14 Thread Rai Ricafrente
Hi guys!

I am now officially baffled on how Nagios handles service escalations and
notifications. I'm using Nagios 3.2.3 on SLES 10 SP3 and my current setup
is this:

service_escalation.cfg:

define serviceescalation {
   service_description http_80
   host_name   apache02
   first_notification  1
   last_notification   5
   notification_interval   60
   escalation_period   Office_Hours
   contact_groups  unix-sms, dba-email, dev-email
}

define serviceescalation {
   service_description http_80
   host_name   apache02
   first_notification  6
   last_notification   8
   notification_interval   90
   escalation_period   Office_Hours
   contact_groups  unix-sms, dba-email, dev-email,
unix-supervisor, dev-supervisor
}

define serviceescalation {
   service_description http_80
   host_name  apache02
   first_notification  1
   last_notification   0
   notification_interval   60
   escalation_period   24x7
   contact_groups  unix-admins-email
}

The users defined in the service_escalation.cfg have their contacts.cfg
configured like this:

define contact{
contact_nameunix-sms
alias   Team UNIX
host_notification_periodEarly_Morning
service_notification_periodEarly_Morning
host_notification_options   u,d,r
service_notification_optionsw,c,u,r
host_notification_commands  host-notify-by-epager
service_notification_commands   notify-by-epager
email   u...@email.org
}

define contact{
contact_nameunix-supervisor
alias   Team UNIX Supervisor
host_notification_periodEarly_Morning
service_notification_periodEarly_Morning
host_notification_options   u,d,r
service_notification_optionsw,c,u,r
host_notification_commands  host-notify-by-epager
service_notification_commands   notify-by-epager
email   unixsupervi...@email.org
}

timeperiod.cfg looks like this:

define timeperiod{
timeperiod_name Office_Hours
alias   Office_Hours
sunday  09:00-20:00
monday  09:00-20:00
tuesday 09:00-20:00
wednesday   09:00-20:00
thursday09:00-20:00
friday  09:00-20:00
saturday09:00-20:00
}

define timeperiod{
timeperiod_name Early_Morning
alias   Early_Morning
sunday  07:00-22:10
monday  07:00-22:10
tuesday 07:00-22:10
wednesday   07:00-22:10
thursday07:00-22:10
friday  07:00-22:10
saturday07:00-22:10
}

With these configurations in place, http_80 service goes down at 10pm every
night (scheduled downtime). I am expecting that notifications starting from
10pm onwards will go *only* to unix-admins-email because of the
service_escalation.cfg file. And it happily did, at least for the critical
notifications.

Now the fun part comes in. The recovery notification was sent to the
unix-sms, dba-email, dev-email, unix-supervisor, dev-supervisor groups at
7:03am, when it returned to OK status, which is weird because the critical
notifications from 10pm to 6am (next day) was sent only and only to the
unix-admins-email group.

Plus, I read from the Nagios docs that it will not send recovery
notifications to those who did not receive the critical/warning/unknown
notifications in the first place.

So my questions are:
Why did Nagios send the recovery alert to the supervisors, who did not know
that the service was down in the first place because they did not receive
the critical alert?
Did Nagios took their defined timeperiods into consideration when it send
the recovery alert?

TIA!
--
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] What the best way to monitor Windows?

2011-08-04 Thread Rai Ricafrente

 I've got a Nagios installation all up and running and working just fine.
 Now I have to start monitoring some Windows Servers.

 There are so many different plugins. What are the recommendations for doing
 this?

 We have some Win Server 2003, 2008 as well as Exchange and SQL.

 Thanks


NSClient++ worked fine for us.
--
BlackBerryreg; DevCon Americas, Oct. 18-20, San Francisco, CA
The must-attend event for mobile developers. Connect with experts. 
Get tools for creating Super Apps. See the latest technologies.
Sessions, hands-on labs, demos  much more. Register early  save!
http://p.sf.net/sfu/rim-blackberry-1___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] Return code of 141 is out of bounds Error in Nagios 3.2.3

2011-06-27 Thread Rai Ricafrente
I finally figured this one out. The reason why the plugin was spewing out
the out of bounds error was because of the underlying performance issue of
the server. When the server becomes slow to respond, Nagios throws this
error. The hint provided by Sven proved that it was indeed the disk I/O
issue that threw everything out of balance, maybe because the plugin
terminates before it could finish what it was doing. Replacing the disk with
a faster one solved this issue.

Case closed. Thanks to all who helped!

Rai
--
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

[Nagios-users] Return code of 141 is out of bounds Error in Nagios 3.2.3

2011-06-19 Thread Rai Ricafrente
Hi everyone,

I just installed a fresh Nagios v3.2.3 with about 150 hosts and 600
services. I just noticed from time to time, hosts are throwing out Return
code of 141 is out of bounds status every now and then, then it will
eventually go away. I don't know if this has anything to do with the plugin
since the status will return to OK state without intervention, which proves
that the check_icmp plugin works just fine.

I'm confused with this error, and this one did not manifest itself when we
were using Nagios v2. Anyone has the same issue?

Big thanks,

Rai
--
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] Nagios Issue of not detecting down/up status of server and delay of mails notifications

2011-06-19 Thread Rai Ricafrente
Have you checked your time periods?

On Fri, Jun 17, 2011 at 10:42 PM, Manish Kumar manikuma...@gmail.comwrote:


 Hi Friends,

 I have implemented Nagios Core-3.2.3 on Fedora Core 14 and configured it
 for monitoring of around 230 network devices for different services like
 up/down status, uptime, ports link status, similarly configured for
 monitoring around 30 windows servers for different services.

 The problem is that nagios is not able to send e-mail notification as soon
 as any server/service/network device goes down. Some of the e-mails are
 delayed around 12 hours and some are not triggered even, for example if a
 server goes down for around a hour and is up again after that, nagios is not
 able to detect it. I have setup the fedora 14 to use its sendmail sever to
 relay the mails to our corporate smtp sever.

 Is this a valid issue with nagios or is there any way to scale it up. How
 can a network/sever admin can believe on it if this works like this.



 --
 Thanks
 Manish Kumar
 http://in.linkedin.com/in/manishkumar85
 http://cens.cdac.in/


 --
 EditLive Enterprise is the world's most technically advanced content
 authoring tool. Experience the power of Track Changes, Inline Image
 Editing and ensure content is compliant with Accessibility Checking.
 http://p.sf.net/sfu/ephox-dev2dev
 ___
 Nagios-users mailing list
 Nagios-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/nagios-users
 ::: Please include Nagios version, plugin version (-v) and OS when
 reporting any issue.
 ::: Messages without supporting info will risk being sent to /dev/null

--
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] Return code of 141 is out of bounds Error in Nagios 3.2.3

2011-06-19 Thread Rai Ricafrente
The output returns OK status when run manually. It seems that the error
occurs at random times, but as mentioned, will eventually go away. If the
plugin is the issue, the error should be persistent. In my case, it happens
from time to time. I only experienced this when we used Nagios 3.2.3, this
never happened in Nagios v2.6



On Mon, Jun 20, 2011 at 10:16 AM, Yueh-Hung Liu yuehung@gmail.comwrote:

 nagios only accepts integers 0~3 as return codes of plugins.
 try to manually execute the command of the questioned service (be the
 user nagios runs as) and check the ouputs.


 On Mon, Jun 20, 2011 at 9:24 AM, Rai Ricafrente maill...@ricafrente.com
 wrote:
  Hi everyone,
 
  I just installed a fresh Nagios v3.2.3 with about 150 hosts and 600
  services. I just noticed from time to time, hosts are throwing out
 Return
  code of 141 is out of bounds status every now and then, then it will
  eventually go away. I don't know if this has anything to do with the
 plugin
  since the status will return to OK state without intervention, which
 proves
  that the check_icmp plugin works just fine.
 
  I'm confused with this error, and this one did not manifest itself when
 we
  were using Nagios v2. Anyone has the same issue?
 
  Big thanks,
 
  Rai
 
 
 --
  EditLive Enterprise is the world's most technically advanced content
  authoring tool. Experience the power of Track Changes, Inline Image
  Editing and ensure content is compliant with Accessibility Checking.
  http://p.sf.net/sfu/ephox-dev2dev
  ___
  Nagios-users mailing list
  Nagios-users@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/nagios-users
  ::: Please include Nagios version, plugin version (-v) and OS when
 reporting
  any issue.
  ::: Messages without supporting info will risk being sent to /dev/null
 


 --
 EditLive Enterprise is the world's most technically advanced content
 authoring tool. Experience the power of Track Changes, Inline Image
 Editing and ensure content is compliant with Accessibility Checking.
 http://p.sf.net/sfu/ephox-dev2dev
 ___
 Nagios-users mailing list
 Nagios-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/nagios-users
 ::: Please include Nagios version, plugin version (-v) and OS when
 reporting any issue.
 ::: Messages without supporting info will risk being sent to /dev/null

--
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] Return code of 141 is out of bounds Error in Nagios 3.2.3

2011-06-19 Thread Rai Ricafrente
Hi Allan,

I'm just saying that this could be a bug in the current Nagios, or the
plugin as Terry pointed out, since this was never present in the previous
version that I was using. This really freaks me out.

I enabled Nagios's debug mode in the hopes of getting more about this. I am
also considering re-compiling the whole thing. There is a nasty little
bugger out there.

Thanks!
--
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

[Nagios-users] current_state value in status.dat file

2011-03-17 Thread Rai Ricafrente
I am looking at the status.dat file and I noticed that the current_state is
=1 even when the host is down and is in critical state. I assume that if a
host is down, the current_state should be =2. I am not sure how Nagios sets
the current_status but in my case, the host is definitely off:

admin1@serverr1:~ ping userver02
PING userver02.localhost (192.168.0.11) 56(84) bytes of data.

--- userver02.localhost ping statistics ---
7 packets transmitted, 0 received, 100% packet loss, time 6012ms

So why does the status.dat file say that the current_status=1? Does it mean
that the host is in Warning state only, and not in Critical as opposed to
what I am seeing in the web console? Anyway, I am bothered by this because I
am using the check_summary.pl plugin (
http://exchange.nagios.org/directory/Plugins/Network-and-Systems-Management/Nagios/check_summary/details)
and
it reads the status.dat file.

Btw, the status.dat file looks like this:

hoststatus {
host_name=userver02
modified_attributes=0
check_command=check-host-alive
check_period=24x7
notification_period=24x7
check_interval=5.00
retry_interval=1.00
event_handler=
has_been_checked=1
should_be_scheduled=1
check_execution_time=2.114
check_latency=0.239
check_type=0
current_state=1
last_hard_state=1
last_event_id=505
current_event_id=509
current_problem_id=232
last_problem_id=0
plugin_output=CRITICAL - userver02.localhost: rta nan, lost 100%
long_plugin_output=
performance_data=rta=0.000ms;800.000;1000.000;0; pl=100%;80;100;;
rtmax=0.000ms rtmin=0.000ms
last_check=1300344149
next_check=1300344459
check_options=0
current_attempt=1
max_attempts=2
current_event_id=509
last_event_id=505
state_type=1
last_state_change=1299383163
last_hard_state_change=1299383163
last_time_up=0
last_time_down=1300344159
last_time_unreachable=0
last_notification=1300344079
next_notification=1300347679
no_more_notifications=0
current_notification_number=267
current_notification_id=2857
notifications_enabled=1
problem_has_been_acknowledged=0
acknowledgement_type=0
active_checks_enabled=1
passive_checks_enabled=1
event_handler_enabled=1
flap_detection_enabled=1
failure_prediction_enabled=1
process_performance_data=1
obsess_over_host=1
last_update=1300344223
is_flapping=0
percent_state_change=0.00
scheduled_downtime_depth=0
}
--
Colocation vs. Managed Hosting
A question and answer guide to determining the best fit
for your organization - today and in the future.
http://p.sf.net/sfu/internap-sfd2d___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] current_state value in status.dat file

2011-03-17 Thread Rai Ricafrente
Thanks for pointing the doc. I've been going through that and I guess I was
looking for the part where it says that DOWN = 1 and not 2.


On Thu, Mar 17, 2011 at 3:40 PM, Giacomo Montagner gmontag...@sorint.itwrote:

 On Thu, 17 Mar 2011 14:57:08 +0800
 Rai Ricafrente maill...@ricafrente.com wrote:

  I am looking at the status.dat file and I noticed that the current_state
 is
  =1 even when the host is down and is in critical state. I assume that if
 a
  host is down, the current_state should be =2.

 It's probably because hosts can be in only 3 states:

 UP=0
 DOWN=1
 UNREACHABLE=2 (
 http://nagios.sourceforge.net/docs/3_0/networkreachability.html)

 Giacomo




 --
 Colocation vs. Managed Hosting
 A question and answer guide to determining the best fit
 for your organization - today and in the future.
 http://p.sf.net/sfu/internap-sfd2d
 ___
 Nagios-users mailing list
 Nagios-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/nagios-users
 ::: Please include Nagios version, plugin version (-v) and OS when
 reporting any issue.
 ::: Messages without supporting info will risk being sent to /dev/null

--
Colocation vs. Managed Hosting
A question and answer guide to determining the best fit
for your organization - today and in the future.
http://p.sf.net/sfu/internap-sfd2d___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null