Re: [zfs-discuss] Recovering from ZFS command lock up after yanking a non-redundant drive?

Chris Baker Sun, 09 Aug 2009 17:53:59 -0700

Hi Sanjeev

OK - had a chance to do more testing over the weekend. Firstly some extra data:


Moving the mirror to both drives on ICH10R ports and on sudden disk power-off 
the mirror faulted cleanly to the remaining drive no problem.

Having a one drive pool on the ICH10R under heavy write traffic and then 
powered off causes the zpool/zfs hangs described above.

ZPool being tested is called "Remove" and consists of:
c7t2d0s0 - attached to the ICH10R
c8t0d0s0 - second disk attached to the Si3132 card with the Si3124 driver

This leads me to the following suspicions:
(1) We have an Si3124 issue in not detecting the drive removal always, or of 
failing to pass that info back to ZFS, even though we know the kernel noticed
(2) In the event that the only disk in a pool goes faulted, the zpool/zfs 
subsystem will block indefinitely waiting to get rid of the pending writes.

I've just recabled back to one disk on ICH10R and one on Si3132 and tried the 
sudden off with the Si drive:

*) First try - mirror faulted and IO continued - good news but confusing
*) Second try - zfs/zpool hung, couldn't even get a zpool status, tried a 
savecore but savecore hung moving the data to a seperate zpool
*) Third try - zfs/zpool hung, ran savecore -L to a UFS filesystem I created 
for the that purpose

After the first try, dmesg shows:
Aug 10 00:34:41 TS1  SATA device detected at port 0
Aug 10 00:34:41 TS1 sata: [ID 663010 kern.info] 
/p...@0,0/pci8086,3...@1c,3/pci1095,7...@0 :
Aug 10 00:34:41 TS1 sata: [ID 761595 kern.info]         SATA disk device at 
port 0
Aug 10 00:34:41 TS1 sata: [ID 846691 kern.info]         model WDC 
WD5000AACS-00ZUB0
Aug 10 00:34:41 TS1 sata: [ID 693010 kern.info]         firmware 01.01B01
Aug 10 00:34:41 TS1 sata: [ID 163988 kern.info]         serial number      
WD-xxxxxxxxxxxxxx
Aug 10 00:34:41 TS1 sata: [ID 594940 kern.info]         supported features:
Aug 10 00:34:41 TS1 sata: [ID 981177 kern.info]          48-bit LBA, DMA, 
Native Command Queueing, SMART, SMART self-test
Aug 10 00:34:41 TS1 sata: [ID 643337 kern.info]         SATA Gen2 signaling 
speed (3.0Gbps)
Aug 10 00:34:41 TS1 sata: [ID 349649 kern.info]         Supported queue depth 
32, limited to 31
Aug 10 00:34:41 TS1 sata: [ID 349649 kern.info]         capacity = 976773168 
sectors
Aug 10 00:34:41 TS1 fmd: [ID 441519 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, 
TYPE: Fault, VER: 1, SEVERITY: Major
Aug 10 00:34:41 TS1 EVENT-TIME: Mon Aug 10 00:34:41 BST 2009
Aug 10 00:34:41 TS1 PLATFORM:                                  , CSN:           
                       , HOSTNAME: TS1
Aug 10 00:34:41 TS1 SOURCE: zfs-diagnosis, REV: 1.0
Aug 10 00:34:41 TS1 EVENT-ID: ab7df266-3380-4a35-e0bc-9056878fd182
Aug 10 00:34:41 TS1 DESC: The number of I/O errors associated with a ZFS device 
exceeded
Aug 10 00:34:41 TS1          acceptable levels.  Refer to 
http://sun.com/msg/ZFS-8000-FD for more information.
Aug 10 00:34:41 TS1 AUTO-RESPONSE: The device has been offlined and marked as 
faulted.  An attempt
Aug 10 00:34:41 TS1          will be made to activate a hot spare if available.
Aug 10 00:34:41 TS1 IMPACT: Fault tolerance of the pool may be compromised.
Aug 10 00:34:41 TS1 REC-ACTION: Run 'zpool status -x' and replace the bad 
device.

and after the second and third test, just:
SATA device detached at port 0

Core files were tar-ed together and bzip2-ed and can be found at:

http://dl.getdropbox.com/u/1709454/dump.bakerci.200908100106.tar.bz2

Please let me know if you need any further core/debug. Apologies to readers 
having all this inflicted by email digest.

Many thanks

Chris
-- 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Recovering from ZFS command lock up after yanking a non-redundant drive?

Reply via email to