Hello everybody,

I just wanted to share my experience with a (partially) broken SSD that was in 
use in a ZIL mirror.

We experienced a dramatic performance problem with one of our zpools, serving 
home directories. Mainly NFS clients were affected. Our SunRay infrastructure 
came to a complete halt.

Finally we were able to identify one SSD as the root caus. The SSD was still 
working, but quite slow.

The issue didn't trigger ZFS to detect the disk as faulty. FMA didn't detect 
it, too.

We identified the broken disk by issuing "iostat -en'. After replacing the SSD, 
everything went back to normal.

To prevent outages like this in the future I hacked together a "quick and 
dirty" bash script to detect disks with a given rate of total errors. The 
script might be used in conjunction with nagios.

Perhaps it's of use for others sa well:

###################################################################
#!/bin/bash
# check disk in all pools for errors.
# partially failing (or slow) disks
# may result in horribly degradded 
# performance of zpools despite the fact
# the pool is still healthy

# exit codes
# 0 OK
# 1 WARNING
# 2 CRITICAL
# 3 UNKONOWN

OUTPUT=""
WARNING="0"
CRITICAL="0"
SOFTLIMIT="5"
HARDLIMIT="20"

LIST=$(zpool status | grep "c[1-9].*d0 " | awk '{print $1}')
    for DISK in $LIST 
    do  
        ERROR=$(iostat -enr $DISK | cut -d "," -f 4 | grep "^[0-9]")
        if [[ $ERROR -gt $SOFTLIMIT ]]
        then
            OUTPUT="$OUTPUT, $DISK:$ERROR"
            WARNING="1"
        fi
        if [[ $ERROR -gt $HARDLIMIT ]]
        then
            OUTPUT="$OUTPUT, $DISK:$ERROR"
            CRITICAL="1"
        fi
    done

if [[ $CRITICAL -gt 0 ]]
then
    echo "CRITICAL: Disks with error count >= $HARDLIMIT found: $OUTPUT"
    exit 2
fi
if [[ $WARNING -gt 0 ]]
then
    echo "WARNING: Disks with error count >= $SOFTLIMIT found: $OUTPUT"
    exit 1
fi

echo "OK: No significant disk errors found"
exit 0

###########################################################################################



cu

Carsten
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to