Re: [Nagios-users] Problem with check_openmanage plugin and storage

2013-06-19 Thread Nic Bernstein
On 06/18/2013 07:55 PM, nagios-users-requ...@lists.sourceforge.net wrote:
 Date: Wed, 19 Jun 2013 02:41:02 +0200 From: Trond Hasle Amundsen
 t.h.amund...@usit.uio.no Subject: Re: [Nagios-users] Problem with
 check_openmanage plugin and storage To: Nagios Users List
 nagios-users@lists.sourceforge.net Message-ID:
 15tk3lrrkyp@tux.uio.no Content-Type: text/plain; charset=utf-8
 Nic Bernstein n...@onlight.com writes:
  We've recently been experimenting with Trond Hasle Amundsen's 
  check_openmanage
  on a large network with about a hundred Dell servers of various ages,
  capabilities, etc.? Mostly PE-2950, R210, R410 and R720.? Much thanks to 
  Trond
  for all his great work on Nagios plugins and other projects, by the way.
 
  We've hit a wall, however, with the storage monitoring aspects of this 
  plugin.
 
  For example, here's a quite specific case.? This is a new PE R720, in 
  debug:
 
  onlight@monitor:~$ check_openmanage -H host -C secret -d
 System:  PowerEdge R720   OMSA version:7.1.0
 ServiceTag:  ###  Plugin version:  3.7.9
 BIOS/date:   1.2.6 05/10/2012 Checking mode:   SNMPv2c 
  UDP/IPv4
  
  -
 Storage Components
  
  =
STATE  |ID|  MESSAGE TEXT
  
  -+--+
OK |0 | Controller 0 [PERC H310 Mini] is Ready
   WARNING |  0:0:1:0 | Physical Disk 0:1:0 [Ata ST2000DM001-9YN164, 
  2.0TB] on ctrl 0 is Online, Not Certified
   WARNING |  0:0:1:1 | Physical Disk 0:1:1 [Ata ST2000DM001-9YN164, 
  2.0TB] on ctrl 0 is Online, Not Certified
OK |  0:0 | Logical Drive '/dev/sda' [RAID-1, 1862.50 GB] is 
  Ready
OK |  0:0 | Connector 0 [SAS] on controller 0 is Ready
OK |  0:1 | Connector 1 [SAS] on controller 0 is Ready
OK |0:0:1 | Enclosure 0:0:1 [Backplane] on controller 0 is 
  Ready
 [...]
  This run exits with 1 (WARNING).
 
  We're not sure we agree with the decision to make the fact that a disk is 
  not
  Dell Certified a Warning, but we can at least understand that.? So, what 
  if we
  exclude storage, with --no-storage?
 The decision to create a warning for non-certified disks belongs to
 Dell. I've tried to let the plugin simply relay the warning level from
 Openmanage, unless it's outright wrong (such as reporting disks in
 predictive failure as OK).

Yes, we completely understand that, and the use of the global status
flag.  I should have been clearer that we get that it wasn't your choice.

  onlight@monitor:~$ check_openmanage -H host -C secret -d --no-storage
 System:  PowerEdge R720   OMSA version:7.1.0
 ServiceTag:  ###  Plugin version:  3.7.9
 BIOS/date:   1.2.6 05/10/2012 Checking mode:   SNMPv2c 
  UDP/IPv4
  
  -
 
[...]
  OOPS! Something is wrong with this server, but I don't know what. The 
  global
  system health status is WARNING, but every component check is OK. This 
  may
  be a bug in the Nagios plugin, please file a bug report.
 
  This yields exit code 3 (UNKNOWN).
 This is a bug. Using blacklisting or check manipulation (such as
 --no-storage) should disable the global health check.

Okay, that's what we'd expect.

  Now, just for argument's sake, let's say we obviate the check for certified
  drives, by commenting out the ? workaround for OMSA 7.1.0 bug code 
  (just
  a handy little short-cut).? Here's what we get then:
 
 [...]
  Again, as with the original case, exit code is 1 (WARNING).
 
  Is there any way around this?? Should I be disabling global health checks??
 Openmanage contains a bug that flips the reported warning level
 wrt. certified disks. Any certified disks are reported as non-certified
 and vice versa. The output above is expected when you remove the
 workaround in the code.

  Here's a run to test that, and it works:
 
  onlight@monitor:~$ check_openmanage -H host -C secret -b pdisk=all
  OK - System: 'PowerEdge R720', SN: '###', 16 GB ram (4 dimms), 1 
  logical drives, 2 physical drives
 Here, the physical disks aren't checked at all, and the global check is
 correctly disabled, so this is an expected result.

  Interestingly, when combining the blacklist with debug (-d -b 
  pdisk=all), the
  exit code is 3 (UNKNOWN), but with debug off, it's 0 (OK).
 Sounds like a bug, perhaps related to the one discussed earlier.

  So, I guess what I'm wondering is why we need to blacklist the physical 
  disks
  (pdisk) instead of using --no-storage?? Shouldn't --no-storage also cause
  globalstatus to be ignored?
 Yes it should, I'll look into that, thanks

Re: [Nagios-users] Problem with check_openmanage plugin and storage

2013-06-19 Thread Trond Hasle Amundsen
Nic Bernstein n...@onlight.com writes:

 Regarding the non-certified disks problem... There is a special
 blacklisting keyword to suppress the message about non-certified disks:

   check_openmanage -b pdisk_cert=all

 Please try this and see if it resolves your issue. Using blacklisting
 should also disable the global health check.


 Ah, that's just what we need.  Much appreciated...

 No, that doesn't seem to be in my version (3.7.9, downloaded yesterday)

 onlight@monitor:~$ perl check_openmanage -H host -C secret -b 
 pdisk_cert=all
 Physical Disk 0:1:0 [Ata ST2000DM001-9YN164, 2.0TB] on ctrl 0 is Online
 Physical Disk 0:1:1 [Ata ST2000DM001-9YN164, 2.0TB] on ctrl 0 is Online
 onlight@monitor:~$ echo $?
 1

 I guess I'll wait for a patch.

Are you sure you didn't test this with the 7.1.0 workaround manually
removed?

 Say Trond, I sent you some notes last week about enhancements we made to your
 check_linux_bonding plugin.  Would you prefer I re-post those to the list
 instead?

Sorry for being non-responsive of late. I've been swamped at work lately
and have attained somewhat of an email backlog. No need to resend :)

Regards,
-- 
Trond H. Amundsen t.h.amund...@usit.uio.no
Center for Information Technology Services, University of Oslo

--
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Problem with check_openmanage plugin and storage

2013-06-18 Thread Trond Hasle Amundsen
Nic Bernstein n...@onlight.com writes:

 We've recently been experimenting with Trond Hasle Amundsen's check_openmanage
 on a large network with about a hundred Dell servers of various ages,
 capabilities, etc.  Mostly PE-2950, R210, R410 and R720.  Much thanks to Trond
 for all his great work on Nagios plugins and other projects, by the way.

 We've hit a wall, however, with the storage monitoring aspects of this plugin.

 For example, here's a quite specific case.  This is a new PE R720, in debug:

 onlight@monitor:~$ check_openmanage -H host -C secret -d
System:  PowerEdge R720   OMSA version:7.1.0
ServiceTag:  ###  Plugin version:  3.7.9
BIOS/date:   1.2.6 05/10/2012 Checking mode:   SNMPv2c UDP/IPv4
 
 -
Storage Components
 
 =
   STATE  |ID|  MESSAGE TEXT
 
 -+--+
   OK |0 | Controller 0 [PERC H310 Mini] is Ready
  WARNING |  0:0:1:0 | Physical Disk 0:1:0 [Ata ST2000DM001-9YN164, 2.0TB] 
 on ctrl 0 is Online, Not Certified
  WARNING |  0:0:1:1 | Physical Disk 0:1:1 [Ata ST2000DM001-9YN164, 2.0TB] 
 on ctrl 0 is Online, Not Certified
   OK |  0:0 | Logical Drive '/dev/sda' [RAID-1, 1862.50 GB] is 
 Ready
   OK |  0:0 | Connector 0 [SAS] on controller 0 is Ready
   OK |  0:1 | Connector 1 [SAS] on controller 0 is Ready
   OK |0:0:1 | Enclosure 0:0:1 [Backplane] on controller 0 is Ready
 
 -
Chassis Components
 
 =
   STATE  |  ID  |  MESSAGE TEXT
 
 -+--+
   OK |0 | Memory module 0 [DIMM_A1, 4096 MB] is Ok
   OK |1 | Memory module 1 [DIMM_A2, 4096 MB] is Ok
   OK |2 | Memory module 2 [DIMM_A3, 4096 MB] is Ok
   OK |3 | Memory module 3 [DIMM_A4, 4096 MB] is Ok
   OK |0 | Chassis fan 0 [System Board Fan1 RPM] reading: 1200 RPM
   OK |1 | Chassis fan 1 [System Board Fan2 RPM] reading: 1080 RPM
   OK |2 | Chassis fan 2 [System Board Fan3 RPM] reading: 1200 RPM
   OK |3 | Chassis fan 3 [System Board Fan4 RPM] reading: 1080 RPM
   OK |4 | Chassis fan 4 [System Board Fan5 RPM] reading: 1080 RPM
   OK |5 | Chassis fan 5 [System Board Fan6 RPM] reading: 1080 RPM
   OK |0 | Power Supply 0 [AC]: Presence detected
   OK |0 | Temperature Probe 0 [System Board Inlet Temp] reads 26 
 C (min=3/-7, max=42/47)
   OK |1 | Temperature Probe 1 [System Board Exhaust Temp] reads 
 33 C (min=8/3, max=70/75)
   OK |2 | Temperature Probe 2 [CPU1 Temp] reads 49 C (min=8/3, 
 max=83/88)
   OK |0 | Processor 0 [Intel Xeon E5-2603 0 1.80GHz] is Present
   OK |0 | Voltage sensor 0 [CPU1 VCORE PG] is Good
   OK |1 | Voltage sensor 1 [System Board 3.3V PG] is Good
   OK |2 | Voltage sensor 2 [System Board 5V PG] is Good
   OK |3 | Voltage sensor 3 [CPU1 PLL PG] is Good
   OK |4 | Voltage sensor 4 [System Board 1.1V PG] is Good
   OK |5 | Voltage sensor 5 [CPU1 M23 VDDQ PG] is Good
   OK |6 | Voltage sensor 6 [CPU1 M23 VTT PG] is Good
   OK |7 | Voltage sensor 7 [System Board FETDRV PG] is Good
   OK |8 | Voltage sensor 8 [CPU1 VSA PG] is Good
   OK |9 | Voltage sensor 9 [CPU1 M01 VDDQ PG] is Good
   OK |   10 | Voltage sensor 10 [System Board NDC PG] is Good
   OK |   11 | Voltage sensor 11 [CPU1 VTT PG] is Good
   OK |   12 | Voltage sensor 12 [System Board 1.5V PG] is Good
   OK |   13 | Voltage sensor 13 [PS2 PG Fail] is Good
   OK |   14 | Voltage sensor 14 [System Board PS1 PG Fail] is Good
   OK |   15 | Voltage sensor 15 [System Board BP1 5V PG] is Good
   OK |   16 | Voltage sensor 16 [CPU1 M01 VTT PG] is Good
   OK |   17 | Voltage sensor 17 [PS1 Voltage 1] reads 114 V
   OK |0 | Battery probe 0 [System Board CMOS Battery] is Presence 
 Detected
   OK |0 | Amperage probe 0 [PS1 Current 1] reads 0.6 A
   OK |1 | Amperage probe 1 [System Board Pwr Consumption] reads 
 56 W
   OK |0 | Chassis intrusion 0 detection: Ok (Not Breached)
   OK |0 | SD Card 0 [vFlash] is Absent
 
 -
Other messages
 
 =
   STATE  |