SMART diags (was: Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good)

2010-02-23 Thread Michael Bilow
On 2010-02-23 at 17:43 -0500, Benjamin Scott wrote:

> On Tue, Feb 23, 2010 at 2:01 PM, Michael Bilow
>  wrote:
>> During the md check operation, the array is "clean" (not degraded)
>> and you can see that explicitly with the "[UU]" status report ...
>
>  Of course, mdstat still calls the array "clean" even after
> mismatches are detected, which isn't what I'd usually call "clean"...
> :-)

Ther term "clean" in this context just means that all of the RAID 
components (physical drives) are still present.

>> It is not a "scrub" because it does not attempt to repair anything.
>
>  Comments in previously mentioned config file don't make it sound
> like that.  "A check operation will scan the drives looking for bad
> sectors and automatically repairing only bad sectors."  It doesn't
> explain how it would repair bad sectors.  Perhaps it means the bad
> sectors will be "repaired" by failing the entire member and having the
> sysadmin insert a new disk.  Perhaps the comments are just wrong.
>
>  Not arguing with you, just reporting what the file told me.  Would
> the file lie?  ;-)

That's sort of true and sort of not true, but generally outdated. It 
is important to appreciate that the "md" device operates at a level 
of abstraction above block devices that isolates it from low-level 
details that are handled by whatever driver manages the block 
devices. For something like a parallel IDE drive -- or, heaven 
forbid, an ST-506 drive -- there is not a lot of intelligence on 
board the drive that will mask error conditions: a read error is a 
read error.

When SCSI (meaning SCSI-2) was developed, it provided for a ton of 
settable parameters, some vendor-independent and some proprietary. 
Among these were mode page bits that controlled what the device 
would do by default on encountering errors during read or write, 
notably the "ARRE" (automatic read reallocation) and "AWRE" 
(automatic write reallocation) bits. Exactly what a device does when 
these bits are asserted is not too well specified, especially 
considering that a disk and a tape may have radically different 
ranges of options but use the same basic SCSI command set. In 
practice, I can't think of any reasonable way to implement ARRE: 
it's almost always worse to return bad data from a read operation 
with a success code than to just have the read operation report a 
failure code outright.

(ATA is essentially a protocol for wrapping SCSI commands and 
responses into packets for non-SCSI devices, so the logic applies.)

>> Detecting and reporting "soft failure" incidents
>> such as reallocations of spare sectors ...
>
>  The relocation algorithm in modern disks generally works like this
> (or so I'm told):
>
> R1. OS requests read logical block from HDD.  HDD tries to read from
> block on disk, and can't, even with retries and ECC.  HDD returns
> failure to the OS, and marks that physical block as "bad" and as a
> candidate for relocation.

At this point, an unreadable block encountered on a block device is 
handled at a very high level, usually the file system, well above 
where things like AWRE on the hardware can occur. This is where the 
"md" driver will intervene, attempting to reconstruct the unreadable 
block from its reservoir of redundancy (the other copy if RAID-1, 
the other stripes if RAID-5). If the "md" driver can reconstruct the 
unreadable data, it will attempt to write the correct data back to 
the block device: at this point, the hardware may reallocate a spare 
sector for the new data. Unless a write occurs somehow, though, even 
with AWRE enabled the hardware should not reallocate a sector.

When a write succeeds and forces an AWRE event, the hardware 
test-reads the newly written data and returns an error if the data 
could not be verified. By this stage, the "md" device may have had 
cause to mark the whole block device as bad and degrade the array.

> R2. Repeated attempts by OS to read from the same block cause the HDD
> to retry.  It won't throw away your data on its own.

Correct, in all practical cases the hardware will never reallocate a 
bad block on read operations. The SCSI protocol provides for ARRE, 
but as I noted this is never really implemented.

> R3. OS requests write to same logical block.  HDD relocate to
> different physical block, and throws away the bad block.  It can do
> that now, since you've told it you don't want the data that was there,
> by writing new data over it.

Again, exactly what happens is going to vary a lot with the 
particular hardware. Older drives, even parallel ATA drives, 
generally cannot reallocate a spare sector on the fly during normal 
operation, but can only do it during a low-level format operation of 
the whole drive. This is because the reserve of spare sectors on 
such drives is associated with physical zones, so that reallocation 
can only occur during a track-granular write operation.

In my experience, nearly all SCSI drives have AWRE disabled from the 
factory, and it is up

Re: SMART diags (was: Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good)

2010-02-24 Thread Benjamin Scott
On Tue, Feb 23, 2010 at 9:32 PM, Michael Bilow
 wrote:
>>  Of course, mdstat still calls the array "clean" even after
>> mismatches are detected, which isn't what I'd usually call "clean"...
>
> Ther term "clean" in this context just means that all of the RAID
> components (physical drives) are still present.

  Like I said, not what I'd usually call "clean".  :)

> ... "md" device operates at a level of abstraction above
> block devices that isolates it ...

  Sure.  That doesn't mean the md driver can't follow an algorithm
that hopes the drive will do something intelligent, or at least hope
that re-writing a block might improve the odds somehow.  What are the
altneratives?  We could fail the whole member out of the array.  That
could be an overraction, and definately reduces redundency if it's
just one bad block out of several billion.  Or we could do nothing.  I
can't think of a situation where rewriting one block could cause
serious problems that weren't already about to break loose.  No?

> Unless a write occurs somehow, though, even
> with AWRE enabled the hardware should not reallocate a sector.

  Right, the drive should remain willing to keep retrying the read as
long as you do.

>> R3. OS requests write to same logical block.
>
> Again, exactly what happens is going to vary a lot with the
> particular hardware. Older drives, even parallel ATA drives,
> generally cannot reallocate a spare sector on the fly ...

  Sure, on-the-fly relocation is a *relatively* new thing.  But it's
been around in the IDE world, at least in theory, for what, ten years?
 Implementation may be inconsistent; that I would buy.  But I know
I've seen both parallel and serial ATA drives where the "relocated
blocks" statistic was non-zero and climbed over time.  I've seen the
"pending relocations" be high until a "badblocks -w" pass, and then it
dropped to zero and "relocated blocks" jumped up.  The smartmontools
FAQ says modern drives can relocate bad sectors on write; their "Bad
block HOWTO" goes into some detail on SCSI drives.  Either there's an
awful lot of misleading happening, or this stuff actually does work
sometimes.  :-)

  I'm not so worried if that 120 MB IDE disk I still have in my
closet[1] doesn't do on-the-fly relocation.  ;-)

[1] = Hey, it might come in handy some day!

  Perhaps what we should all be worrying about, rather than ancient
drives, is the flood of USB flash stuff that's happening.  Anyone know
how *that* typically does when it comes to self-monitoring and
-healing?  It'd be a shame if the migration to flash storage sets us
back years in that area.

>> It make me wonder just what the
>> overall SMART health is supposed to indicate -- "Yes, the HDD is
>> physically present"?  :)
>
> SMART is just a communications protocol.

  So, basically, the SMART "overall health" (or whatever it's called)
is just reporting whatever the manufacturer programmed the drive to
report, and may be completely useless.  Good to know.  :)

>>  I did once have the BIOS check start reporting a SMART health
>> warning, but all the OEM diagnostics, smartctl, "badblocks -w", etc.,
>> didn't actually report anything wrong.
>
> SMART is not designed to predict infant mortality and unusual
> failures ...

  Whatever.  :)  My point was that the drive seemed to be indicating
something was wrong, but nobody[2] could figure out why it was doing
that.  SMART overall health was reporting failure but everything else
seemed to be good.  Like I said, it could be the drive knew something
that couldn't be reported using other tools, and it actually averted a
real failure.

[2] = Well, for sufficiently small definitions of "nobody".  Me, one
tech support guy, and a handful of software tools.  :)

-- Ben

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: SMART diags (was: Re: mismatch_cnt != 0, member content mismatch, but md says the mirror is good)

2010-02-24 Thread Benjamin Scott
P.S.:

On Tue, Feb 23, 2010 at 9:32 PM, Michael Bilow
 wrote:
> At this point, an unreadable block encountered on a block device is
> handled at a very high level, usually the file system, well above
> where things like AWRE on the hardware can occur.

  Heck, it's not even handled by the filesystem.  It usually goes
something like this: HDD returns error to the controller, controller
driver returns error to the block device layer, block layer returns
error to filesystem, filesystem returns error to C library, C library
returns error to application, application pukes on its shoes, sysadmin
gets a call at 3 AM saying the server is down.  ;-)

-- Ben
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/