Hi Sanjeev OK - had a chance to do more testing over the weekend. Firstly some extra data:
Moving the mirror to both drives on ICH10R ports and on sudden disk power-off the mirror faulted cleanly to the remaining drive no problem. Having a one drive pool on the ICH10R under heavy write traffic and then powered off causes the zpool/zfs hangs described above. ZPool being tested is called "Remove" and consists of: c7t2d0s0 - attached to the ICH10R c8t0d0s0 - second disk attached to the Si3132 card with the Si3124 driver This leads me to the following suspicions: (1) We have an Si3124 issue in not detecting the drive removal always, or of failing to pass that info back to ZFS, even though we know the kernel noticed (2) In the event that the only disk in a pool goes faulted, the zpool/zfs subsystem will block indefinitely waiting to get rid of the pending writes. I've just recabled back to one disk on ICH10R and one on Si3132 and tried the sudden off with the Si drive: *) First try - mirror faulted and IO continued - good news but confusing *) Second try - zfs/zpool hung, couldn't even get a zpool status, tried a savecore but savecore hung moving the data to a seperate zpool *) Third try - zfs/zpool hung, ran savecore -L to a UFS filesystem I created for the that purpose After the first try, dmesg shows: Aug 10 00:34:41 TS1 SATA device detected at port 0 Aug 10 00:34:41 TS1 sata: [ID 663010 kern.info] /p...@0,0/pci8086,3...@1c,3/pci1095,7...@0 : Aug 10 00:34:41 TS1 sata: [ID 761595 kern.info] SATA disk device at port 0 Aug 10 00:34:41 TS1 sata: [ID 846691 kern.info] model WDC WD5000AACS-00ZUB0 Aug 10 00:34:41 TS1 sata: [ID 693010 kern.info] firmware 01.01B01 Aug 10 00:34:41 TS1 sata: [ID 163988 kern.info] serial number WD-xxxxxxxxxxxxxx Aug 10 00:34:41 TS1 sata: [ID 594940 kern.info] supported features: Aug 10 00:34:41 TS1 sata: [ID 981177 kern.info] 48-bit LBA, DMA, Native Command Queueing, SMART, SMART self-test Aug 10 00:34:41 TS1 sata: [ID 643337 kern.info] SATA Gen2 signaling speed (3.0Gbps) Aug 10 00:34:41 TS1 sata: [ID 349649 kern.info] Supported queue depth 32, limited to 31 Aug 10 00:34:41 TS1 sata: [ID 349649 kern.info] capacity = 976773168 sectors Aug 10 00:34:41 TS1 fmd: [ID 441519 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major Aug 10 00:34:41 TS1 EVENT-TIME: Mon Aug 10 00:34:41 BST 2009 Aug 10 00:34:41 TS1 PLATFORM: , CSN: , HOSTNAME: TS1 Aug 10 00:34:41 TS1 SOURCE: zfs-diagnosis, REV: 1.0 Aug 10 00:34:41 TS1 EVENT-ID: ab7df266-3380-4a35-e0bc-9056878fd182 Aug 10 00:34:41 TS1 DESC: The number of I/O errors associated with a ZFS device exceeded Aug 10 00:34:41 TS1 acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. Aug 10 00:34:41 TS1 AUTO-RESPONSE: The device has been offlined and marked as faulted. An attempt Aug 10 00:34:41 TS1 will be made to activate a hot spare if available. Aug 10 00:34:41 TS1 IMPACT: Fault tolerance of the pool may be compromised. Aug 10 00:34:41 TS1 REC-ACTION: Run 'zpool status -x' and replace the bad device. and after the second and third test, just: SATA device detached at port 0 Core files were tar-ed together and bzip2-ed and can be found at: http://dl.getdropbox.com/u/1709454/dump.bakerci.200908100106.tar.bz2 Please let me know if you need any further core/debug. Apologies to readers having all this inflicted by email digest. Many thanks Chris -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss