Re: [Nagios-users] Problem with check_openmanage plugin and storage
On 06/18/2013 07:55 PM, nagios-users-requ...@lists.sourceforge.net wrote: Date: Wed, 19 Jun 2013 02:41:02 +0200 From: Trond Hasle Amundsen t.h.amund...@usit.uio.no Subject: Re: [Nagios-users] Problem with check_openmanage plugin and storage To: Nagios Users List nagios-users@lists.sourceforge.net Message-ID: 15tk3lrrkyp@tux.uio.no Content-Type: text/plain; charset=utf-8 Nic Bernstein n...@onlight.com writes: We've recently been experimenting with Trond Hasle Amundsen's check_openmanage on a large network with about a hundred Dell servers of various ages, capabilities, etc.? Mostly PE-2950, R210, R410 and R720.? Much thanks to Trond for all his great work on Nagios plugins and other projects, by the way. We've hit a wall, however, with the storage monitoring aspects of this plugin. For example, here's a quite specific case.? This is a new PE R720, in debug: onlight@monitor:~$ check_openmanage -H host -C secret -d System: PowerEdge R720 OMSA version:7.1.0 ServiceTag: ### Plugin version: 3.7.9 BIOS/date: 1.2.6 05/10/2012 Checking mode: SNMPv2c UDP/IPv4 - Storage Components = STATE |ID| MESSAGE TEXT -+--+ OK |0 | Controller 0 [PERC H310 Mini] is Ready WARNING | 0:0:1:0 | Physical Disk 0:1:0 [Ata ST2000DM001-9YN164, 2.0TB] on ctrl 0 is Online, Not Certified WARNING | 0:0:1:1 | Physical Disk 0:1:1 [Ata ST2000DM001-9YN164, 2.0TB] on ctrl 0 is Online, Not Certified OK | 0:0 | Logical Drive '/dev/sda' [RAID-1, 1862.50 GB] is Ready OK | 0:0 | Connector 0 [SAS] on controller 0 is Ready OK | 0:1 | Connector 1 [SAS] on controller 0 is Ready OK |0:0:1 | Enclosure 0:0:1 [Backplane] on controller 0 is Ready [...] This run exits with 1 (WARNING). We're not sure we agree with the decision to make the fact that a disk is not Dell Certified a Warning, but we can at least understand that.? So, what if we exclude storage, with --no-storage? The decision to create a warning for non-certified disks belongs to Dell. I've tried to let the plugin simply relay the warning level from Openmanage, unless it's outright wrong (such as reporting disks in predictive failure as OK). Yes, we completely understand that, and the use of the global status flag. I should have been clearer that we get that it wasn't your choice. onlight@monitor:~$ check_openmanage -H host -C secret -d --no-storage System: PowerEdge R720 OMSA version:7.1.0 ServiceTag: ### Plugin version: 3.7.9 BIOS/date: 1.2.6 05/10/2012 Checking mode: SNMPv2c UDP/IPv4 - [...] OOPS! Something is wrong with this server, but I don't know what. The global system health status is WARNING, but every component check is OK. This may be a bug in the Nagios plugin, please file a bug report. This yields exit code 3 (UNKNOWN). This is a bug. Using blacklisting or check manipulation (such as --no-storage) should disable the global health check. Okay, that's what we'd expect. Now, just for argument's sake, let's say we obviate the check for certified drives, by commenting out the ? workaround for OMSA 7.1.0 bug code (just a handy little short-cut).? Here's what we get then: [...] Again, as with the original case, exit code is 1 (WARNING). Is there any way around this?? Should I be disabling global health checks?? Openmanage contains a bug that flips the reported warning level wrt. certified disks. Any certified disks are reported as non-certified and vice versa. The output above is expected when you remove the workaround in the code. Here's a run to test that, and it works: onlight@monitor:~$ check_openmanage -H host -C secret -b pdisk=all OK - System: 'PowerEdge R720', SN: '###', 16 GB ram (4 dimms), 1 logical drives, 2 physical drives Here, the physical disks aren't checked at all, and the global check is correctly disabled, so this is an expected result. Interestingly, when combining the blacklist with debug (-d -b pdisk=all), the exit code is 3 (UNKNOWN), but with debug off, it's 0 (OK). Sounds like a bug, perhaps related to the one discussed earlier. So, I guess what I'm wondering is why we need to blacklist the physical disks (pdisk) instead of using --no-storage?? Shouldn't --no-storage also cause globalstatus to be ignored? Yes it should, I'll look into that, thanks
Re: [Nagios-users] Problem with check_openmanage plugin and storage
Nic Bernstein n...@onlight.com writes: Regarding the non-certified disks problem... There is a special blacklisting keyword to suppress the message about non-certified disks: check_openmanage -b pdisk_cert=all Please try this and see if it resolves your issue. Using blacklisting should also disable the global health check. Ah, that's just what we need. Much appreciated... No, that doesn't seem to be in my version (3.7.9, downloaded yesterday) onlight@monitor:~$ perl check_openmanage -H host -C secret -b pdisk_cert=all Physical Disk 0:1:0 [Ata ST2000DM001-9YN164, 2.0TB] on ctrl 0 is Online Physical Disk 0:1:1 [Ata ST2000DM001-9YN164, 2.0TB] on ctrl 0 is Online onlight@monitor:~$ echo $? 1 I guess I'll wait for a patch. Are you sure you didn't test this with the 7.1.0 workaround manually removed? Say Trond, I sent you some notes last week about enhancements we made to your check_linux_bonding plugin. Would you prefer I re-post those to the list instead? Sorry for being non-responsive of late. I've been swamped at work lately and have attained somewhat of an email backlog. No need to resend :) Regards, -- Trond H. Amundsen t.h.amund...@usit.uio.no Center for Information Technology Services, University of Oslo -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Problem with check_openmanage plugin and storage
Nic Bernstein n...@onlight.com writes: We've recently been experimenting with Trond Hasle Amundsen's check_openmanage on a large network with about a hundred Dell servers of various ages, capabilities, etc. Mostly PE-2950, R210, R410 and R720. Much thanks to Trond for all his great work on Nagios plugins and other projects, by the way. We've hit a wall, however, with the storage monitoring aspects of this plugin. For example, here's a quite specific case. This is a new PE R720, in debug: onlight@monitor:~$ check_openmanage -H host -C secret -d System: PowerEdge R720 OMSA version:7.1.0 ServiceTag: ### Plugin version: 3.7.9 BIOS/date: 1.2.6 05/10/2012 Checking mode: SNMPv2c UDP/IPv4 - Storage Components = STATE |ID| MESSAGE TEXT -+--+ OK |0 | Controller 0 [PERC H310 Mini] is Ready WARNING | 0:0:1:0 | Physical Disk 0:1:0 [Ata ST2000DM001-9YN164, 2.0TB] on ctrl 0 is Online, Not Certified WARNING | 0:0:1:1 | Physical Disk 0:1:1 [Ata ST2000DM001-9YN164, 2.0TB] on ctrl 0 is Online, Not Certified OK | 0:0 | Logical Drive '/dev/sda' [RAID-1, 1862.50 GB] is Ready OK | 0:0 | Connector 0 [SAS] on controller 0 is Ready OK | 0:1 | Connector 1 [SAS] on controller 0 is Ready OK |0:0:1 | Enclosure 0:0:1 [Backplane] on controller 0 is Ready - Chassis Components = STATE | ID | MESSAGE TEXT -+--+ OK |0 | Memory module 0 [DIMM_A1, 4096 MB] is Ok OK |1 | Memory module 1 [DIMM_A2, 4096 MB] is Ok OK |2 | Memory module 2 [DIMM_A3, 4096 MB] is Ok OK |3 | Memory module 3 [DIMM_A4, 4096 MB] is Ok OK |0 | Chassis fan 0 [System Board Fan1 RPM] reading: 1200 RPM OK |1 | Chassis fan 1 [System Board Fan2 RPM] reading: 1080 RPM OK |2 | Chassis fan 2 [System Board Fan3 RPM] reading: 1200 RPM OK |3 | Chassis fan 3 [System Board Fan4 RPM] reading: 1080 RPM OK |4 | Chassis fan 4 [System Board Fan5 RPM] reading: 1080 RPM OK |5 | Chassis fan 5 [System Board Fan6 RPM] reading: 1080 RPM OK |0 | Power Supply 0 [AC]: Presence detected OK |0 | Temperature Probe 0 [System Board Inlet Temp] reads 26 C (min=3/-7, max=42/47) OK |1 | Temperature Probe 1 [System Board Exhaust Temp] reads 33 C (min=8/3, max=70/75) OK |2 | Temperature Probe 2 [CPU1 Temp] reads 49 C (min=8/3, max=83/88) OK |0 | Processor 0 [Intel Xeon E5-2603 0 1.80GHz] is Present OK |0 | Voltage sensor 0 [CPU1 VCORE PG] is Good OK |1 | Voltage sensor 1 [System Board 3.3V PG] is Good OK |2 | Voltage sensor 2 [System Board 5V PG] is Good OK |3 | Voltage sensor 3 [CPU1 PLL PG] is Good OK |4 | Voltage sensor 4 [System Board 1.1V PG] is Good OK |5 | Voltage sensor 5 [CPU1 M23 VDDQ PG] is Good OK |6 | Voltage sensor 6 [CPU1 M23 VTT PG] is Good OK |7 | Voltage sensor 7 [System Board FETDRV PG] is Good OK |8 | Voltage sensor 8 [CPU1 VSA PG] is Good OK |9 | Voltage sensor 9 [CPU1 M01 VDDQ PG] is Good OK | 10 | Voltage sensor 10 [System Board NDC PG] is Good OK | 11 | Voltage sensor 11 [CPU1 VTT PG] is Good OK | 12 | Voltage sensor 12 [System Board 1.5V PG] is Good OK | 13 | Voltage sensor 13 [PS2 PG Fail] is Good OK | 14 | Voltage sensor 14 [System Board PS1 PG Fail] is Good OK | 15 | Voltage sensor 15 [System Board BP1 5V PG] is Good OK | 16 | Voltage sensor 16 [CPU1 M01 VTT PG] is Good OK | 17 | Voltage sensor 17 [PS1 Voltage 1] reads 114 V OK |0 | Battery probe 0 [System Board CMOS Battery] is Presence Detected OK |0 | Amperage probe 0 [PS1 Current 1] reads 0.6 A OK |1 | Amperage probe 1 [System Board Pwr Consumption] reads 56 W OK |0 | Chassis intrusion 0 detection: Ok (Not Breached) OK |0 | SD Card 0 [vFlash] is Absent - Other messages = STATE |