My setup: A SuperMicro 24-drive chassis with Intel dual-processor motherboard, three LSI SAS3081E controllers, and 24 SATA 2TB hard drives, divided into three pools with each pool a single eight-disk RAID-Z2. (Boot is an SSD connected to motherboard SATA.)
This morning I got a cheerful email from my monitoring script: "Zchecker has discovered a problem on bigdawg." The full output is below, but I have one unavailable pool and two degraded pools, with all my problem disks connected to controller c10. I have multiple spare controllers available. First question-- is there an easy way to identify which controller is c10? Second question-- What is the best way to handle replacement (of either the bad controller or of all three controllers if I can't identify the bad controller)? I was thinking that I should be able to shut the server down, remove the controller(s), install the replacement controller(s), check to see that all the drives are visible, run zpool clear for each pool and then do another scrub to verify the problem has been resolved. Does that sound like a good plan? === pool: uberdisk1 state: UNAVAIL status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-HC scrub: scrub in progress for 3h7m, 24.08% done, 9h52m to go config: NAME STATE READ WRITE CKSUM uberdisk1 UNAVAIL 55 0 0 insufficient replicas raidz2 UNAVAIL 112 0 0 insufficient replicas c9t0d0 ONLINE 0 0 0 c9t1d0 ONLINE 0 0 0 c9t2d0 ONLINE 0 0 0 c10t0d0 UNAVAIL 43 30 0 experienced I/O failures c10t1d0 REMOVED 0 0 0 c10t2d0 ONLINE 74 0 0 c11t1d0 ONLINE 0 0 0 c11t2d0 ONLINE 0 0 0 errors: 1 data errors, use '-v' for a list pool: uberdisk2 state: DEGRADED scrub: scrub in progress for 3h3m, 32.26% done, 6h24m to go config: NAME STATE READ WRITE CKSUM uberdisk2 DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c9t3d0 ONLINE 0 0 0 c9t4d0 ONLINE 0 0 0 c9t5d0 ONLINE 0 0 0 c10t3d0 REMOVED 0 0 0 c10t4d0 REMOVED 0 0 0 c11t3d0 ONLINE 0 0 0 c11t4d0 ONLINE 0 0 0 c11t5d0 ONLINE 0 0 0 errors: No known data errors pool: uberdisk3 state: DEGRADED status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-HC scrub: scrub in progress for 2h58m, 31.95% done, 6h19m to go config: NAME STATE READ WRITE CKSUM uberdisk3 DEGRADED 1 0 0 raidz2 DEGRADED 4 0 0 c9t6d0 ONLINE 0 0 0 c9t7d0 ONLINE 0 0 0 c10t5d0 ONLINE 5 0 0 c10t6d0 ONLINE 98 94 0 c10t7d0 REMOVED 0 0 0 c11t6d0 ONLINE 0 0 0 c11t7d0 ONLINE 0 0 0 c11t8d0 ONLINE 0 0 0 errors: 1 data errors, use '-v' for a list -- Dave Pooser, ACSA Manager of Information Services Alford Media http://www.alfordmedia.com _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss