Wow, sounds familiar - binderedondat. I thought it was just when using expanders... guess it's just anything 1068-based. Lost a 20TB pool to having the controller basically just hose up what it was doing and write scragged data to the disk.
1) The suggestion using the serial number of the drive to trace back to what's connected to what is good, assuming you can pull drives to look at their serial numbers. 2) One thing I've done over the years is, given that I often use the same motherboards, is physically map out the PCI slot addresses - /dev/cfg/c2 ../../devices/p...@0,0/pci8086,3...@3/pci1000,3...@0:scsi /dev/cfg/c3 ../../devices/p...@7a,0/pci8086,3...@5/pci1000,3...@0/ip...@f:scsi /dev/cfg/c4 ../../devices/p...@7a,0/pci8086,3...@5/pci1000,3...@0/ip...@v0:scsi /dev/cfg/c5 ../../devices/p...@7a,0/pci8086,3...@7/pci1000,3...@0/ip...@f:scsi /dev/cfg/c6 ../../devices/p...@7a,0/pci8086,3...@7/pci1000,3...@0/ip...@v0:scsi /dev/cfg/c7 ../../devices/p...@7a,0/pci8086,3...@9/pci1000,3...@0:scsi ^^^^^^^^^^^^^^^ this part will correspond with a physical slot ^^^^^^^^^ if you have a SM dual-IOH board, these represent the two IOH-36s ^ on a single-IOH board, I've noted that this often corresponds to the physical slot number (it's "unit-address" from DDI) So far, it's involved "stick a card in a slot/reboot/reconfig/see what address it's at/note it down" or other forms of reverse engineering. Handy to have occasionally. If you're doing a BYO, taking the time up front to figure this out is a Good Idea. 3) Get a copy of "lsiutil" for solaris (available from LSI's site) - it's an easy way to check out the controller and see if it's there or whether it sees the drives or what. (There is a newer version of lsiutil that supports the 2008s... strangely, it's not available from the LSI site. Their tech support didn't even know it existed when I asked. I got my copy off someone on hardforum.) 4) Things you didn't want to know: the LSI1068 actually has a very small write cache on board. So if you manage a certain set of situations (namely, setting the "device i/o timeout" in the BIOS to something other than 0, then having a SATA drive blow up in a certain way such that it hangs for longer than the timeout you set), the mpt driver (it seems) can get impatient and re-initialize the controller, or that's what it looks like. Great way to scrag a volume. :( 5) your basic plan seems sound. > Message: 1 > Date: Sat, 06 Nov 2010 13:27:08 -0500 > From: Dave Pooser <dave....@alfordmedia.com> > To: <zfs-discuss@opensolaris.org> > Subject: [zfs-discuss] Apparent SAS HBA failure-- now what? > Message-ID: <c8fb082c.34460%dave....@alfordmedia.com> > Content-Type: text/plain; charset="US-ASCII" > > First question-- is there an easy way to identify which controller is c10? > Second question-- What is the best way to handle replacement (of either the > bad controller or of all three controllers if I can't identify the bad > controller)? I was thinking that I should be able to shut the server down, > remove the controller(s), install the replacement controller(s), check to > see that all the drives are visible, run zpool clear for each pool and then > do another scrub to verify the problem has been resolved. Does that sound > like a good plan? _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss