On Tue, 14 Aug 2007, Richard Elling wrote:

> Rick Wager wrote:
>> We see similar problems on a SuperMicro with 5 500 GB Seagate sata drives. 
>> This is using the AHCI driver. We do not, however, see problems with the 
>> same hardware/drivers if we use 250GB drives.
>
> Duh.  The error is from the disk :-)

A likely possiblity is that the disk drives are simply not getting 
enough (cool) airflow and are over-heating during periods of high 
system activity that generates a lot of disk head movement; for 
example, during a zpool scrub.  And the extra platters present in the 
larger disk drives would require even more cooling capacity - which 
would validate your observations.

Best to actually *measure* the effectiveness of the disk cooling 
design/installation.  Recommendation: investigate the Fluke mini 
infrared thermometers - for example - the Fluke 62 at: 
http://www.testequipmentdepot.com/fluke/thermometers/62.htm

In some disk drive installations, its possible for the infrared probe 
to "see" the disk HDA (Head Disk Assembly) without disturbing the 
drive.

PS: I use a much older Fluke 80T-IR in combination with a digital 
multimeter with millivolt resolution (a Fluke meter of course!).

>> We sometimes see bad blocks reported (are these automatically remapped 
>> somehow so they are not used again?) and sometimes sata port resets.
>
> Depending on how the errors are reported, the driver may attempt a reset
> to clear.  The drive may also automaticaly spare bad blocks.
>
>> Here is a sample of the log output. Any help understanding and/or resolving 
>> this issue greatly appreciated. I very much don't wont to have freezes in 
>> production.
>>
>> Aug 14 11:20:28 chazz1  port 2: device reset
>> Aug 14 11:20:28 chazz1 scsi: [ID 107833 kern.warning] WARNING: /[EMAIL 
>> PROTECTED],0/pci15d9,[EMAIL PROTECTED],2/[EMAIL PROTECTED],0 (sd3):
>> Aug 14 11:20:28 chazz1  Error for Command: write                   Error 
>> Level: Retryable
>> Aug 14 11:20:28 chazz1 scsi: [ID 107833 kern.notice]    Requested Block: 530 
>>                       Error Block: 530
>> Aug 14 11:20:28 chazz1 scsi: [ID 107833 kern.notice]    Vendor: ATA          
>>                       Serial Number:
>> Aug 14 11:20:28 chazz1 scsi: [ID 107833 kern.notice]    Sense Key: 
>> No_Additional_Sense
>> Aug 14 11:20:28 chazz1 scsi: [ID 107833 kern.notice]    ASC: 0x0 (no 
>> additional sense info), ASCQ: 0x0, FRU: 0x0
>
> This error was transient and retried.  If it was a fatal error (still
> failed after retries) then you'll have another, different message
> describing the failed condition.
>  -- richard
>

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
            Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to