Re: [zfs-discuss] scrub halts

Richard Elling Tue, 14 Aug 2007 13:15:13 -0700

Rick Wager wrote:
>   Thanks Richard!
> 
> That's the way I read the errors, also, they seem to indicate bad blocks 
> on the drives.  The bad news is that when they occur access to the zfs 
> file system "stops" for quite a long time - seemingly from 30 seconds to 
> a minute or longer.


They might be bad blocks, though usually we get more info than "no additional
sense info."

30 seconds is a typical default retry timeout.  The file system will seem
to stop because it is ATA and can't handle multiple I/O operations
concurrently.

> Do you have a recommendation for how to identify and map the bad blocks 
> so they are not used again? Should I fill my disk with data in order to 
> identify the bad blocks?

The format command has a number of media scan and repair options.

> Also, for what its worth as I've been running a simple test on my system 
> to copy a large number of files around a zpool in order to fill it up 
> and verify the zpool is working reliably. Just using a simple shell 
> script with very bad results:
> - In one window the shell script has frozen for about an hour now, the 
> cp command is just hung.
> - ls of the directories in the zfs file system just hangs
> - In another window zpool status is also hung and never returns:
> 
>     chazz1<##> zpool
>     status                                                                    
>                                                                               
>                   
>           [11:54:49]
>       pool: fmpool
>      state: ONLINE
>      scrub: scrub completed with 0 errors on Tue Aug 14 12:01:40 2007
>     ^C^C^C^C^C^C^C^Z

Typical reaction for a malfunctioning disk.

> - And this is the output from zpool iostat
> 
>     chazz1<*> zpool iostat 10
>     10                                                                        
>                                                                    
>     [12:53:01]
>                    capacity     operations    bandwidth
>     pool         used  avail   read  write   read  write
>     ----------  -----  -----  -----  -----  -----  -----
>     fmpool       186G  2.08T    131    121  15.4M  13.3M
>     fmpool       186G  2.08T      0      0      0      0
>     fmpool       186G  2.08T      0      0      0      0
>     fmpool       186G  2.08T      0      0      0      0
>     fmpool       186G  2.08T      0      0      0      0
>     fmpool       186G  2.08T      0      0      0      0
> 
> I think we'll have to reboot to clear this frozen condition.
> 
> Any thoughts?

According to the ahci man page, the driver does not yet support
NCQ, which would also be consistent with the observed behaviour.
Do the disks work ok in other machines?
  -- richard

> Thanks,
> Rick
> 
> Richard Elling wrote:
>> Rick Wager wrote:
>>> We see similar problems on a SuperMicro with 5 500 GB Seagate sata 
>>> drives. This is using the AHCI driver. We do not, however, see 
>>> problems with the same hardware/drivers if we use 250GB drives. 
>>
>> Duh.  The error is from the disk :-)
>>
>>> We sometimes see bad blocks reported (are these automatically 
>>> remapped somehow so they are not used again?) and sometimes sata port 
>>> resets.
>>
>> Depending on how the errors are reported, the driver may attempt a reset
>> to clear.  The drive may also automaticaly spare bad blocks.
>>
>>> Here is a sample of the log output. Any help understanding and/or 
>>> resolving this issue greatly appreciated. I very much don't wont to 
>>> have freezes in production.
>>> Aug 14 11:20:28 chazz1  port 2: device reset
>>> Aug 14 11:20:28 chazz1 scsi: [ID 107833 kern.warning] WARNING: 
>>> /[EMAIL PROTECTED],0/pci15d9,[EMAIL PROTECTED],2/[EMAIL PROTECTED],0 (sd3):
>>> Aug 14 11:20:28 chazz1  Error for Command: write                   
>>> Error Level: Retryable
>>> Aug 14 11:20:28 chazz1 scsi: [ID 107833 kern.notice]    Requested 
>>> Block: 530                       Error Block: 530
>>> Aug 14 11:20:28 chazz1 scsi: [ID 107833 kern.notice]    Vendor: 
>>> ATA                                Serial Number:             Aug 14 
>>> 11:20:28 chazz1 scsi: [ID 107833 kern.notice]    Sense Key: 
>>> No_Additional_Sense
>>> Aug 14 11:20:28 chazz1 scsi: [ID 107833 kern.notice]    ASC: 0x0 (no 
>>> additional sense info), ASCQ: 0x0, FRU: 0x0
>>
>> This error was transient and retried.  If it was a fatal error (still
>> failed after retries) then you'll have another, different message
>> describing the failed condition.
>>  -- richard
>>
> 
> -- 
> 
> Rick Wager                                           
> email: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
> 303-818-0576 (mobile)
> 
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] scrub halts

Reply via email to