Re: [zfs-discuss] Apparent SAS HBA failure-- now what?

Jeff Bacon Sun, 07 Nov 2010 16:23:13 -0800

Wow, sounds familiar - binderedondat. I thought it was just when using
expanders... guess it's just anything 1068-based. Lost a 20TB pool to
having the controller basically just hose up what it was doing and write
scragged data to the disk.


1) The suggestion using the serial number of the drive to trace back to
what's connected to what is good, assuming you can pull drives to look
at their serial numbers. 

2) One thing I've done over the years is, given that I often use the
same motherboards, is physically map out the PCI slot addresses - 

/dev/cfg/c2 ../../devices/p...@0,0/pci8086,3...@3/pci1000,3...@0:scsi
/dev/cfg/c3
../../devices/p...@7a,0/pci8086,3...@5/pci1000,3...@0/ip...@f:scsi
/dev/cfg/c4
../../devices/p...@7a,0/pci8086,3...@5/pci1000,3...@0/ip...@v0:scsi
/dev/cfg/c5
../../devices/p...@7a,0/pci8086,3...@7/pci1000,3...@0/ip...@f:scsi
/dev/cfg/c6
../../devices/p...@7a,0/pci8086,3...@7/pci1000,3...@0/ip...@v0:scsi
/dev/cfg/c7 ../../devices/p...@7a,0/pci8086,3...@9/pci1000,3...@0:scsi
                                   ^^^^^^^^^^^^^^^ this part will
correspond with a physical slot 
                         ^^^^^^^^^ if you have a SM dual-IOH board,
these represent the two IOH-36s
                                                ^ on a single-IOH board,
I've noted that this often
                                                corresponds to the
physical slot number (it's "unit-address" from DDI)

So far, it's involved "stick a card in a slot/reboot/reconfig/see what
address it's at/note it down" or other forms of reverse engineering.
Handy to have occasionally. If you're doing a BYO, taking the time up
front to figure this out is a Good Idea. 

3) Get a copy of "lsiutil" for solaris (available from LSI's site) -
it's an easy way to check out the controller and see if it's there or
whether it sees the drives or what.

(There is a newer version of lsiutil that supports the 2008s...
strangely, it's not available from the LSI site. Their tech support
didn't even know it existed when I asked. I got my copy off someone on
hardforum.) 

4) Things you didn't want to know: the LSI1068 actually has a very small
write cache on board. So if you manage a certain set of situations
(namely, setting the "device i/o timeout" in the BIOS to something other
than 0, then having a SATA drive blow up in a certain way such that it
hangs for longer than the timeout you set), the mpt driver (it seems)
can get impatient and re-initialize the controller, or that's what it
looks like. Great way to scrag a volume. :(
 
5) your basic plan seems sound.


> Message: 1
> Date: Sat, 06 Nov 2010 13:27:08 -0500
> From: Dave Pooser <dave....@alfordmedia.com>
> To: <zfs-discuss@opensolaris.org>
> Subject: [zfs-discuss] Apparent SAS HBA failure-- now what?
> Message-ID: <c8fb082c.34460%dave....@alfordmedia.com>
> Content-Type: text/plain;     charset="US-ASCII"
> 
> First question-- is there an easy way to identify which controller is
c10?
> Second question-- What is the best way to handle replacement (of
either the
> bad controller or of all three controllers if I can't identify the bad
> controller)? I was thinking that I should be able to shut the server
down,
> remove the controller(s), install the replacement controller(s), check
to
> see that all the drives are visible, run zpool clear for each pool and
then
> do another scrub to verify the problem has been resolved. Does that
sound
> like a good plan?

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Apparent SAS HBA failure-- now what?

Reply via email to