Hi John,
comment below...

On Oct 11, 2012, at 3:10 AM, Carsten John <cj...@mpi-bremen.de> wrote:

> Hello everybody,
> 
> I just wanted to share my experience with a (partially) broken SSD that was 
> in use in a ZIL mirror.
> 
> We experienced a dramatic performance problem with one of our zpools, serving 
> home directories. Mainly NFS clients were affected. Our SunRay infrastructure 
> came to a complete halt.
> 
> Finally we were able to identify one SSD as the root caus. The SSD was still 
> working, but quite slow.
> 
> The issue didn't trigger ZFS to detect the disk as faulty. FMA didn't detect 
> it, too.
> 
> We identified the broken disk by issuing "iostat -en'. After replacing the 
> SSD, everything went back to normal.
> 
> To prevent outages like this in the future I hacked together a "quick and 
> dirty" bash script to detect disks with a given rate of total errors. The 
> script might be used in conjunction with nagios.

This shouldn't be needed. All of the fields of iostat are in kstats and nagios 
can already
collect kstats.
        kstat -pm sderr

The good thing about using this method is that it works with or without ZFS.
The bad thing is that some SMART tools and devices trigger complaints that
show up as errors (that can be safely ignored)
 -- richard

> 
> Perhaps it's of use for others sa well:
> 
> ###################################################################
> #!/bin/bash
> # check disk in all pools for errors.
> # partially failing (or slow) disks
> # may result in horribly degradded 
> # performance of zpools despite the fact
> # the pool is still healthy
> 
> # exit codes
> # 0 OK
> # 1 WARNING
> # 2 CRITICAL
> # 3 UNKONOWN
> 
> OUTPUT=""
> WARNING="0"
> CRITICAL="0"
> SOFTLIMIT="5"
> HARDLIMIT="20"
> 
> LIST=$(zpool status | grep "c[1-9].*d0 " | awk '{print $1}')
>    for DISK in $LIST 
>    do  
>        ERROR=$(iostat -enr $DISK | cut -d "," -f 4 | grep "^[0-9]")
>        if [[ $ERROR -gt $SOFTLIMIT ]]
>        then
>            OUTPUT="$OUTPUT, $DISK:$ERROR"
>            WARNING="1"
>        fi
>        if [[ $ERROR -gt $HARDLIMIT ]]
>        then
>            OUTPUT="$OUTPUT, $DISK:$ERROR"
>            CRITICAL="1"
>        fi
>    done
> 
> if [[ $CRITICAL -gt 0 ]]
> then
>    echo "CRITICAL: Disks with error count >= $HARDLIMIT found: $OUTPUT"
>    exit 2
> fi
> if [[ $WARNING -gt 0 ]]
> then
>    echo "WARNING: Disks with error count >= $SOFTLIMIT found: $OUTPUT"
>    exit 1
> fi
> 
> echo "OK: No significant disk errors found"
> exit 0
> 
> ###########################################################################################
> 
> 
> 
> cu
> 
> Carsten
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--

richard.ell...@richardelling.com
+1-760-896-4422









_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to