[zfs-discuss] Apparent SAS HBA failure-- now what?

Dave Pooser Sat, 06 Nov 2010 11:29:06 -0700

My setup: A SuperMicro 24-drive chassis with Intel dual-processor
motherboard, three LSI SAS3081E controllers, and 24 SATA 2TB hard drives,
divided into three pools with each pool a single eight-disk RAID-Z2. (Boot
is an SSD connected to motherboard SATA.)


This morning I got a cheerful email from my monitoring script: "Zchecker has
discovered a problem on bigdawg." The full output is below, but I have one
unavailable pool and two degraded pools, with all my problem disks connected
to controller c10. I have multiple spare controllers available.

First question-- is there an easy way to identify which controller is c10?
Second question-- What is the best way to handle replacement (of either the
bad controller or of all three controllers if I can't identify the bad
controller)? I was thinking that I should be able to shut the server down,
remove the controller(s), install the replacement controller(s), check to
see that all the drives are visible, run zpool clear for each pool and then
do another scrub to verify the problem has been resolved. Does that sound
like a good plan?

===
pool: uberdisk1
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool
clear'.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: scrub in progress for 3h7m, 24.08% done, 9h52m to go
config:

    NAME         STATE     READ WRITE CKSUM
    uberdisk1    UNAVAIL     55     0     0  insufficient replicas
      raidz2     UNAVAIL    112     0     0  insufficient replicas
        c9t0d0   ONLINE       0     0     0
        c9t1d0   ONLINE       0     0     0
        c9t2d0   ONLINE       0     0     0
        c10t0d0  UNAVAIL     43    30     0  experienced I/O failures
        c10t1d0  REMOVED      0     0     0
        c10t2d0  ONLINE      74     0     0
        c11t1d0  ONLINE       0     0     0
        c11t2d0  ONLINE       0     0     0

errors: 1 data errors, use '-v' for a list

  pool: uberdisk2
 state: DEGRADED
 scrub: scrub in progress for 3h3m, 32.26% done, 6h24m to go
config:

    NAME         STATE     READ WRITE CKSUM
    uberdisk2    DEGRADED     0     0     0
      raidz2     DEGRADED     0     0     0
        c9t3d0   ONLINE       0     0     0
        c9t4d0   ONLINE       0     0     0
        c9t5d0   ONLINE       0     0     0
        c10t3d0  REMOVED      0     0     0
        c10t4d0  REMOVED      0     0     0
        c11t3d0  ONLINE       0     0     0
        c11t4d0  ONLINE       0     0     0
        c11t5d0  ONLINE       0     0     0

errors: No known data errors

  pool: uberdisk3
 state: DEGRADED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool
clear'.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: scrub in progress for 2h58m, 31.95% done, 6h19m to go
config:

    NAME         STATE     READ WRITE CKSUM
    uberdisk3    DEGRADED     1     0     0
      raidz2     DEGRADED     4     0     0
        c9t6d0   ONLINE       0     0     0
        c9t7d0   ONLINE       0     0     0
        c10t5d0  ONLINE       5     0     0
        c10t6d0  ONLINE      98    94     0
        c10t7d0  REMOVED      0     0     0
        c11t6d0  ONLINE       0     0     0
        c11t7d0  ONLINE       0     0     0
        c11t8d0  ONLINE       0     0     0

errors: 1 data errors, use '-v' for a list

-- 
Dave Pooser, ACSA
Manager of Information Services
Alford Media  http://www.alfordmedia.com


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Apparent SAS HBA failure-- now what?

Reply via email to