Hi, We have a Nagios server that monitors around 300 production servers and around 2000+ services on all these servers. Recently, when the STATE of one of the services on a particular host turned HARD, but Nagios didn't NOTIFY. So I am just trying to understand why it didn't notify. Here's more information regarding the configuration:
define service { service_description MAILQ_1K_2K host_name server-name use generic-service check_command check_mailq_snmp!1000!2000 contact_groups cg_server-name } define contactgroup { contactgroup_name cg_server-name alias server-name Contact Group members team_emailpage-24x7 } define contact { contact_name team_emailpage-24x7 alias team_emailpage-24x7 service_notification_period 24x7 host_notification_period 24x7 service_notification_options c,r host_notification_options d,r service_notification_commands notify-by-page,notify-by-email host_notification_commands host-notify-by-page,host-notify-by-email email email-address pager team } Following are the few relevant options defined under "generic-service": check_period 24x7 normal_check_interval 5 retry_check_interval 2 max_check_attempts 5 notification_period 24x7 And following are the corresponding logs when the service went down: Oct 11 02:19:50 nagios-server nagios: SERVICE ALERT: server-name;MAILQ_1K_2K;WARNING;SOFT;1;mailq is 1358 Oct 11 02:22:49 nagios-server nagios: SERVICE ALERT: server-name;MAILQ_1K_2K;WARNING;SOFT;2;mailq is 1537 Oct 11 02:26:05 nagios-server nagios: SERVICE ALERT: server-name;MAILQ_1K_2K;WARNING;SOFT;3;mailq is 1799 Oct 11 02:28:59 nagios-server nagios: SERVICE ALERT: server-name;MAILQ_1K_2K;WARNING;SOFT;4;mailq is 1799 Oct 11 02:36:53 nagios-server nagios: SERVICE ALERT: server-name;MAILQ_1K_2K;CRITICAL;HARD;5;mailq is 2133 I modified the server names. The WARNING THRESHOLD is 1000 and CRITICAL THRESHOLD is 2000. After roughly 45 minutes later, the service recovered, but Nagios didn't fire any alert w.r.t this service during this whole period (i mean until it came back to OK state). Nagios logs when this service came back: Oct 11 03:20:20 nagios-server nagios: SERVICE ALERT: server-name;MAILQ_1K_2K;CRITICAL;SOFT;1;mailq is 2968 Oct 11 03:22:17 nagios-server nagios: SERVICE ALERT: server-name;MAILQ_1K_2K;CRITICAL;SOFT;2;mailq is 2968 Oct 11 03:24:17 nagios-server nagios: SERVICE ALERT: server-name;MAILQ_1K_2K;CRITICAL;SOFT;3;mailq is 2968 Oct 11 03:26:18 nagios-server nagios: SERVICE ALERT: server-name;MAILQ_1K_2K;OK;SOFT;4;mailq is 411 More info: Looking at Nagios documentation, I understand that Nagios does "on demand host checks" when a service changes STATE. So I guessed, Nagios might have performed HOST CHECK when it actually turned HARD (and simultaneously from WARNING to CRITICAL). And I see lot of logs related to other services after this SERVICE turned HARD, but I wonder there should have been NOTIFICATION w.r.t this particular service. Thoughts?? Nagios version: 3.1.2 O.S: Debian 4.0 (Etch) Thanks in advance. Thanks, Satish ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null