> This message is from the disk saying that it aborted a command. These
> are
> usually preceded by a reset, as shown here. What caused the reset
> condition?
> Was it actually target 11 or did target 11 get caught up in the reset
> storm?
> 

It happed in the mid-night and nobody touched the file box.
I assume it is the transition status before the disk is *thoroughly* damaged:

Jun 10 09:34:11 cn03 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, 
TYPE: Fault, VER: 1, SEVERITY: 

Major
Jun 10 09:34:11 cn03 EVENT-TIME: Fri Jun 10 09:34:11 CST 2011
Jun 10 09:34:11 cn03 PLATFORM: X8DTH-i-6-iF-6F, CSN: 1234567890, HOSTNAME: cn03
Jun 10 09:34:11 cn03 SOURCE: zfs-diagnosis, REV: 1.0
Jun 10 09:34:11 cn03 EVENT-ID: 4f4bfc2c-f653-ed20-ab13-eef72224af5e
Jun 10 09:34:11 cn03 DESC: The number of I/O errors associated with a ZFS 
device exceeded
Jun 10 09:34:11 cn03         acceptable levels.  Refer to 
http://sun.com/msg/ZFS-8000-FD for more information.
Jun 10 09:34:11 cn03 AUTO-RESPONSE: The device has been offlined and marked as 
faulted.  An attempt
Jun 10 09:34:11 cn03         will be made to activate a hot spare if available.
Jun 10 09:34:11 cn03 IMPACT: Fault tolerance of the pool may be compromised.
Jun 10 09:34:11 cn03 REC-ACTION: Run 'zpool status -x' and replace the bad 
device.

After I rebooted it, I got:
Jun 10 11:38:49 cn03 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.11 
Version snv_134 64-bit
Jun 10 11:38:49 cn03 genunix: [ID 683174 kern.notice] Copyright 1983-2010 Sun 
Microsystems, Inc.  All rights 

reserved.
Jun 10 11:38:49 cn03 Use is subject to license terms.
Jun 10 11:38:49 cn03 unix: [ID 126719 kern.info] features: 

7f7fffff<sse4_2,sse4_1,ssse3,cpuid,mwait,tscp,cmp,cx16,sse3,nx,asysc,htt,sse2,sse,sep,pat,cx8,pae,mca,mmx,cmov,d

e,pge,mtrr,msr,tsc,lgpg>

Jun 10 11:39:06 cn03 scsi: [ID 365881 kern.info] 
/pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0):
Jun 10 11:39:06 cn03    mptsas0 unrecognized capability 0x3

Jun 10 11:39:42 cn03 scsi: [ID 107833 kern.warning] WARNING: 
/scsi_vhci/disk@g5000c50009723937 (sd3):
Jun 10 11:39:42 cn03    drive offline
Jun 10 11:39:47 cn03 scsi: [ID 107833 kern.warning] WARNING: 
/scsi_vhci/disk@g5000c50009723937 (sd3):
Jun 10 11:39:47 cn03    drive offline
Jun 10 11:39:52 cn03 scsi: [ID 107833 kern.warning] WARNING: 
/scsi_vhci/disk@g5000c50009723937 (sd3):
Jun 10 11:39:52 cn03    drive offline
Jun 10 11:39:57 cn03 scsi: [ID 107833 kern.warning] WARNING: 
/scsi_vhci/disk@g5000c50009723937 (sd3):
Jun 10 11:39:57 cn03    drive offline


> 
> Hot spare will not help you here. The problem is not constrained to one
> disk.
> In fact, a hot spare may be the worst thing here because it can kick in
> for the disk
> complaining about a clogged expander or spurious resets.  This causes a
> resilver
> that reads from the actual broken disk, that causes more resets, that
> kicks out another
> disk that causes a resilver, and so on.
>  -- richard
> 

So the warm spares could be "better" choice under this situation?
BTW, in what condition, the scsi reset storm will happen?
How can we be immune to this so as not to avoid interrupting the file service?


Thanks.
Fred
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to