On Tue, 14 Aug 2007, Richard Elling wrote: > Rick Wager wrote: >> We see similar problems on a SuperMicro with 5 500 GB Seagate sata drives. >> This is using the AHCI driver. We do not, however, see problems with the >> same hardware/drivers if we use 250GB drives. > > Duh. The error is from the disk :-)
A likely possiblity is that the disk drives are simply not getting enough (cool) airflow and are over-heating during periods of high system activity that generates a lot of disk head movement; for example, during a zpool scrub. And the extra platters present in the larger disk drives would require even more cooling capacity - which would validate your observations. Best to actually *measure* the effectiveness of the disk cooling design/installation. Recommendation: investigate the Fluke mini infrared thermometers - for example - the Fluke 62 at: http://www.testequipmentdepot.com/fluke/thermometers/62.htm In some disk drive installations, its possible for the infrared probe to "see" the disk HDA (Head Disk Assembly) without disturbing the drive. PS: I use a much older Fluke 80T-IR in combination with a digital multimeter with millivolt resolution (a Fluke meter of course!). >> We sometimes see bad blocks reported (are these automatically remapped >> somehow so they are not used again?) and sometimes sata port resets. > > Depending on how the errors are reported, the driver may attempt a reset > to clear. The drive may also automaticaly spare bad blocks. > >> Here is a sample of the log output. Any help understanding and/or resolving >> this issue greatly appreciated. I very much don't wont to have freezes in >> production. >> >> Aug 14 11:20:28 chazz1 port 2: device reset >> Aug 14 11:20:28 chazz1 scsi: [ID 107833 kern.warning] WARNING: /[EMAIL >> PROTECTED],0/pci15d9,[EMAIL PROTECTED],2/[EMAIL PROTECTED],0 (sd3): >> Aug 14 11:20:28 chazz1 Error for Command: write Error >> Level: Retryable >> Aug 14 11:20:28 chazz1 scsi: [ID 107833 kern.notice] Requested Block: 530 >> Error Block: 530 >> Aug 14 11:20:28 chazz1 scsi: [ID 107833 kern.notice] Vendor: ATA >> Serial Number: >> Aug 14 11:20:28 chazz1 scsi: [ID 107833 kern.notice] Sense Key: >> No_Additional_Sense >> Aug 14 11:20:28 chazz1 scsi: [ID 107833 kern.notice] ASC: 0x0 (no >> additional sense info), ASCQ: 0x0, FRU: 0x0 > > This error was transient and retried. If it was a fatal error (still > failed after retries) then you'll have another, different message > describing the failed condition. > -- richard > Regards, Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss