Re: end to end error recovery musings

2007-03-01 Thread James Bottomley
On Wed, 2007-02-28 at 17:28 -0800, H. Peter Anvin wrote:
 James Bottomley wrote:
  On Wed, 2007-02-28 at 12:42 -0500, Martin K. Petersen wrote:
  4104.  It's 8 bytes per hardware sector.  At least for T10...
  
  Er ... that won't look good to the 512 ATA compatibility remapping ...
  
 
 Well, in that case you'd only see 8x512 data bytes, no metadata...

i.e. no support for block guard in the 512 byte sector emulation
mode ...

James


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-03-01 Thread H. Peter Anvin

James Bottomley wrote:

On Wed, 2007-02-28 at 17:28 -0800, H. Peter Anvin wrote:

James Bottomley wrote:

On Wed, 2007-02-28 at 12:42 -0500, Martin K. Petersen wrote:

4104.  It's 8 bytes per hardware sector.  At least for T10...

Er ... that won't look good to the 512 ATA compatibility remapping ...


Well, in that case you'd only see 8x512 data bytes, no metadata...


i.e. no support for block guard in the 512 byte sector emulation
mode ...



That makes sense, though... if the raw sector size is 4096 bytes, that 
metadata would presumably not exist on a per-sector basis.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: end to end error recovery musings

2007-02-28 Thread Martin K. Petersen
 Eric == Moore, Eric [EMAIL PROTECTED] writes:

[Trimmed the worldwide broadcast CC: list down to linux-scsi]

Eric I from the scsi lld perspective, all we need 32 byte cdbs, and a
Eric mechinism to pass the tags down from above.  

Ok, so your board only supports Type 2 protection?


Eric It appears our driver to firmware insterface is only providing
Eric the reference and application tags. 

My current code allows the submitter to specify which tags are valid
between the OS and the HBA.  Your inbound scsi_cmnd will have a 
protection_tag_mask which tells you which fields are provided.

Similarly, there's a mask in scsi_host which allows the HBA to
identify which protection types it supports.  I hadn't envisioned that
an HBA might only provide a subset.  I'll ponder a bit.


Eric I assume that for transfers greater than a sector, that the
Eric controller firmware updates the tags for all the other sectors
Eric within the boundary.  

In other words you only support one app tag per request and not per
sector?

-- 
Martin K. Petersen  Oracle Linux Engineering

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: end to end error recovery musings

2007-02-28 Thread Moore, Eric
On Tuesday, February 27, 2007 12:07 PM, Martin K. Petersen wrote: 
 
 Not sure you're up-to-date on the T10 data integrity feature.
 Essentially it's an extension of the 520 byte sectors common in disk
 arrays.  For each 512 byte sector (or 4KB ditto) you get 8 bytes of
 protection data.  There's a 2 byte CRC (GUARD tag), a 2 byte
 user-defined tag (APP) and a 4-byte reference tag (REF).  Depending on
 how the drive is formatted, the REF tag usually needs to match the
 lower 32-bits of the target sector #.
 

I from the scsi lld perspective, all we need 32 byte cdbs, and a
mechinism to pass the tags down from above.  It appears our driver to
firmware insterface is only providing the reference and application
tags. It seems the guard tag is not present, so I guess mpt fusion
controller firmware is setting it(I will have to check with others).   I
assume that for transfers greater than a sector, that the controller
firmware updates the tags for all the other sectors within the boundary.
I'm sure the flags probably tell whether EEDP is enabled or not.   I
will have to check if there are some manufacturing pages that say
whether the controller is capable of EEDP(as not all our controllers
support it).  


Here are the EEDP associated fields we provide in our scsi passthru, as
well as target assist.


u32 SecondaryReferenceTag
u16 SecondaryApplicationTag
u16 EEDPFlags
u16 ApplicationTagTranslationMask
u32 EEDPBlockSize
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-28 Thread James Bottomley
On Wed, 2007-02-28 at 12:16 -0500, Martin K. Petersen wrote:
 It's cool that it's on the radar in terms of the protocol.  
 
 That doesn't mean that drive manufacturers are going to implement it,
 though.  The ones I've talked to were unwilling to sacrifice capacity
 because that's the main competitive factor in the SATA/consumer space.
 
 Maybe we'll see it in the nearline product ranges?  That would be a
 good start...

They wouldn't necessarily have to sacrifice capacity per-se.  The
current problem is that unlike SCSI disks, you can't seem to reformat
SATA ones to arbitrary sector sizes.  However, I could see the SATA
manufacturers selling capacity at 512 (or the new 4096) sectors but
allowing their OEMs to reformat them 520 (or 4160) and then implementing
block guard on top of this.  The OEMs who did this would obviously lose
1.6% of the capacity, but that would be their choice ...

James


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-28 Thread Martin K. Petersen
 James == James Bottomley [EMAIL PROTECTED] writes:

James However, I could see the SATA manufacturers selling capacity at
James 512 (or the new 4096) sectors but allowing their OEMs to
James reformat them 520 (or 4160)

4104.  It's 8 bytes per hardware sector.  At least for T10...

-- 
Martin K. Petersen  Oracle Linux Engineering
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-28 Thread H. Peter Anvin

James Bottomley wrote:

On Wed, 2007-02-28 at 12:42 -0500, Martin K. Petersen wrote:

4104.  It's 8 bytes per hardware sector.  At least for T10...


Er ... that won't look good to the 512 ATA compatibility remapping ...



Well, in that case you'd only see 8x512 data bytes, no metadata...

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-27 Thread Martin K. Petersen
 Eric == Moore, Eric [EMAIL PROTECTED] writes:

Eric Martin K. Petersen on Data Intergrity Feature, which is also
Eric called EEDP(End to End Data Protection), which he presented some
Eric ideas/suggestions of adding an API in linux for this.  

T10 DIF is interesting for a few things: 

 - Ensuring that the data integrity is preserved when writing a buffer
   to disk

 - Ensuring that the write ends up on the right hardware sector

These features make the most sense in terms of WRITE.  Disks already
have plenty of CRC on the data so if a READ fails on a regular drive
we already know about it.

We can, however, leverage DIF with my proposal to expose the
protection data to host memory.  This will allow us to verify the data
integrity information before passing it to the filesystem or
application.  We can say this is really the information the disk
sent. It hasn't been mangled along the way.

And by using the APP tag we can mark a sector as - say - metadata or
data to ease putting the recovery puzzle back together.

It would be great if the app tag was more than 16 bits.  Ted mentioned
that ideally he'd like to store the inode number in the app tag.  But
as it stands there isn't room.

In any case this is all slightly orthogonal to Ric's original post
about finding the right persistence heuristics in the error handling
path...

-- 
Martin K. Petersen  Oracle Linux Engineering

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-27 Thread Alan
 These features make the most sense in terms of WRITE.  Disks already
 have plenty of CRC on the data so if a READ fails on a regular drive
 we already know about it.

Don't bet on it. If you want to do this seriously you need an end to end
(media to host ram) checksum. We do see bizarre and quite evil things
happen to people occasionally because they rely on bus level protection -
both faulty network cards and faulty disk or controller RAM can cause very
bad things to happen in a critical environment and are very very hard to
detect and test for.

IDE has another hideously evil feature in this area. Command blocks are
sent by PIO cycles, and are therefore unprotected from corruption. So
while a data burst with corruption will error and retry and command which
corrupts the block number although very very much less likely (less bits
and much lower speed) will not be caught on a PATA system for read or for
write and will hit the wrong block.

With networking you can turn off hardware IP checksumming (and many
cluster people do) with disks we don't yet have a proper end to end
checksum to media system in the fs or block layers.

 It would be great if the app tag was more than 16 bits.  Ted mentioned
 that ideally he'd like to store the inode number in the app tag.  But
 as it stands there isn't room.

The lowest few bits are the most important with ext2/ext3 because you
normally lose a sector of inodes which means you've got dangly bits
associated with a sequence of inodes with the same upper bits. More
problematic is losing indirect blocks, and being able to keep some kind
of [inode low bits/block index] would help put stuff back together.

Alan
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-27 Thread Andreas Dilger
On Feb 27, 2007  19:02 +, Alan wrote:
  It would be great if the app tag was more than 16 bits.  Ted mentioned
  that ideally he'd like to store the inode number in the app tag.  But
  as it stands there isn't room.
 
 The lowest few bits are the most important with ext2/ext3 because you
 normally lose a sector of inodes which means you've got dangly bits
 associated with a sequence of inodes with the same upper bits. More
 problematic is losing indirect blocks, and being able to keep some kind
 of [inode low bits/block index] would help put stuff back together.

In the ext4 extents format there is the ability (not implemented yet)
to add some extra information into the extent index blocks (previously
referred to as the ext3_extent_tail).  This is planned to be a checksum
of the index block, and a back-pointer to the inode which is using this
extent block.

This allows online detection of corrupt index blocks, and also detection
of an index block that is written to the wrong location.  There is as
yet no plan that I'm aware of to have in-filesystem checksums of the
extent data.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-27 Thread Ric Wheeler

Martin K. Petersen wrote:

Eric == Moore, Eric [EMAIL PROTECTED] writes:


Eric Martin K. Petersen on Data Intergrity Feature, which is also
Eric called EEDP(End to End Data Protection), which he presented some
Eric ideas/suggestions of adding an API in linux for this.  

T10 DIF is interesting for a few things: 


 - Ensuring that the data integrity is preserved when writing a buffer
   to disk

 - Ensuring that the write ends up on the right hardware sector

These features make the most sense in terms of WRITE.  Disks already
have plenty of CRC on the data so if a READ fails on a regular drive
we already know about it.


There are paths through a read that could still benefit from the extra 
data integrity.  The CRC gets validated on the physical sector, but we 
don't have the same level of strict data checking once it is read into 
the disk's write cache or being transferred out of cache on the way to 
the transport...




We can, however, leverage DIF with my proposal to expose the
protection data to host memory.  This will allow us to verify the data
integrity information before passing it to the filesystem or
application.  We can say this is really the information the disk
sent. It hasn't been mangled along the way.

And by using the APP tag we can mark a sector as - say - metadata or
data to ease putting the recovery puzzle back together.

It would be great if the app tag was more than 16 bits.  Ted mentioned
that ideally he'd like to store the inode number in the app tag.  But
as it stands there isn't room.

In any case this is all slightly orthogonal to Ric's original post
about finding the right persistence heuristics in the error handling
path...



Still all a very relevant discussion - I agree that we could really use 
more than just 16 bits...


ric

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-27 Thread Martin K. Petersen
 Alan == Alan  [EMAIL PROTECTED] writes:

 Not sure you're up-to-date on the T10 data integrity feature.
 Essentially it's an extension of the 520 byte sectors common in
 disk

[...]

Alan but here's a minor bit of passing bad news - quite a few older
Alan ATA controllers can't issue DMA transfers that are not a
Alan multiple of 512 bytes without crapping themselves (eg
Alan READ_LONG). Guess we may need to add
Alan ap- i_do_not_suck or similar 8)

I'm afraid it stops even before you get that far.  There doesn't seem
to be any interest in adopting the Data Integrity Feature (or anything
similar) in the ATA camp.  So for now it's a SCSI-only thing.

I encourage people to lean on their favorite disk manufacturer.  This
would be a great feature to have on SATA too...

-- 
Martin K. Petersen  Oracle Linux Engineering

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-26 Thread Ric Wheeler


Alan wrote:
I think that this is mostly true, but we also need to balance this against the 
need for higher levels to get a timely response.  In a really large IO, a naive 
retry of a very large write could lead to a non-responsive system for a very 
large time...


And losing the I/O could result in a system that is non responsive until
the tape restore completes two days later


Which brings us back to a recent discussion at the file system workshop on being 
more repair oriented in file system design so we can survive situations like 
this a bit more reliably ;-)


ric
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-26 Thread H. Peter Anvin

Theodore Tso wrote:


In any case, the reason why I bring this up is that it would be really
nice if there was a way with a single laptop drive to be able to do
snapshots and background fsck's without having to use initrd's with
device mapper.



This is a major part of why I've been trying to push integrated klibc to 
have all that stuff as a unified kernel deliverable.  Unfortunately, 
as you know, Linus apparently rejected the concept at least for now at 
LKS last year.


With klibc this stuff could still be in one single wrapper without funny 
dependencies, but wouldn't have to be ported to kernel space.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-26 Thread Ric Wheeler



Jeff Garzik wrote:

Theodore Tso wrote:

Can someone with knowledge of current disk drive behavior confirm that
for all drives that support bad block sparing, if an attempt to write
to a particular spot on disk results in an error due to bad media at
that spot, the disk drive will automatically rewrite the sector to a
sector in its spare pool, and automatically redirect that sector to
the new location.  I believe this should be always true, so presumably
with all modern disk drives a write error should mean something very
serious has happend.  



This is what will /probably/ happen.  The drive should indeed find a 
spare sector and remap it, if the write attempt encounters a bad spot on 
the media.


However, with a large enough write, large enough bad-spot-on-media, and 
a firmware programmed to never take more than X seconds to complete 
their enterprise customers' I/O, it might just fail.



IMO, somewhere in the kernel, when we receive a read-op or write-op 
media error, we should immediately try to plaster that area with small 
writes.  Sure, if it's a read-op you lost data, but this method will 
maximize the chance that you can refresh/reuse the logical sectors in 
question.


Jeff


One interesting counter example is a smaller write than a full page - say 512 
bytes out of 4k.


If we need to do a read-modify-write and it just so happens that 1 of the 7 
sectors we need to read is flaky, will this look like a write failure?


ric
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-26 Thread Alan
 One interesting counter example is a smaller write than a full page - say 512 
 bytes out of 4k.
 
 If we need to do a read-modify-write and it just so happens that 1 of the 7 
 sectors we need to read is flaky, will this look like a write failure?

The current core kernel code can't handle propogating sub-page sized
errors up to the file system layers (there is nowhere in the page cache
to store 'part of this page is missing'). This is a long standing (four
year plus) problem with CD-RW support as well.

For ATA we can at least retrieve the true media sector size now, which
may be helpful at the physical layer but the page cache would need to
grow some brains to do anything with it.
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-25 Thread Douglas Gilbert
H. Peter Anvin wrote:
 Ric Wheeler wrote:

 We still have the following challenges:

(1) read-ahead often means that we will  retry every bad sector at
 least twice from the file system level. The first time, the fs read
 ahead request triggers a speculative read that includes the bad sector
 (triggering the error handling mechanisms) right before the real
 application triggers a read does the same thing.  Not sure what the
 answer is here since read-ahead is obviously a huge win in the normal
 case.

 
 Probably the only sane thing to do is to remember the bad sectors and
 avoid attempting reading them; that would mean marking automatic
 versus explicitly requested requests to determine whether or not to
 filter them against a list of discovered bad blocks.

Some disks are doing their own read-ahead in the form
of a background media scan. Scans are done on request or
periodically (e.g. once per day or once per week) and we
have tools that can fetch the scan results from a disk
(e.g. a list of unreadable sectors). What we don't have
is any way to feed such information to a file system
that may be impacted.

Doug Gilbert


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-24 Thread Chris Wedgwood
On Fri, Feb 23, 2007 at 09:32:29PM -0500, Theodore Tso wrote:

 And having a way of making this list available to both the
 filesystem and to a userspace utility, so they can more easily deal
 with doing a forced rewrite of the bad sector, after determining
 which file is involved and perhaps doing something intelligent (up
 to and including automatically requesting a backup system to fetch a
 backup version of the file, and if it can be determined that the
 file shouldn't have been changed since the last backup,
 automatically fixing up the corrupted data block :-).

i had a small c program + perl script that would take a badblocks list
and figure out which files on an xfs filesystem were trashed, though
in the case of xfs it's somewhat easier because you can dump the
extents for a file something more generic wouldn't be hard to make
work, it also wouldn't be hard to extend this to inodes in some cases
though im not sure that there is much you can do there beyond fsck

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-23 Thread H. Peter Anvin

Ric Wheeler wrote:


We still have the following challenges:

   (1) read-ahead often means that we will  retry every bad sector at 
least twice from the file system level. The first time, the fs read 
ahead request triggers a speculative read that includes the bad sector 
(triggering the error handling mechanisms) right before the real 
application triggers a read does the same thing.  Not sure what the 
answer is here since read-ahead is obviously a huge win in the normal case.




Probably the only sane thing to do is to remember the bad sectors and 
avoid attempting reading them; that would mean marking automatic 
versus explicitly requested requests to determine whether or not to 
filter them against a list of discovered bad blocks.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-23 Thread Andreas Dilger
On Feb 23, 2007  16:03 -0800, H. Peter Anvin wrote:
 Ric Wheeler wrote:
(1) read-ahead often means that we will  retry every bad sector at 
 least twice from the file system level. The first time, the fs read 
 ahead request triggers a speculative read that includes the bad sector 
 (triggering the error handling mechanisms) right before the real 
 application triggers a read does the same thing.  Not sure what the 
 answer is here since read-ahead is obviously a huge win in the normal case.
 
 Probably the only sane thing to do is to remember the bad sectors and 
 avoid attempting reading them; that would mean marking automatic 
 versus explicitly requested requests to determine whether or not to 
 filter them against a list of discovered bad blocks.

And clearing this list when the sector is overwritten, as it will almost
certainly be relocated at the disk level.  For that matter, a huge win
would be to have the MD RAID layer rewrite only the bad sector (in hopes
of the disk relocating it) instead of failing the whiole disk.  Otherwise,
a few read errors on different disks in a RAID set can take the whole
system offline.  Apologies if this is already done in recent kernels...

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-23 Thread H. Peter Anvin

Andreas Dilger wrote:

And clearing this list when the sector is overwritten, as it will almost
certainly be relocated at the disk level.


Certainly if the overwrite is successful.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-23 Thread Theodore Tso
On Fri, Feb 23, 2007 at 05:37:23PM -0700, Andreas Dilger wrote:
  Probably the only sane thing to do is to remember the bad sectors and 
  avoid attempting reading them; that would mean marking automatic 
  versus explicitly requested requests to determine whether or not to 
  filter them against a list of discovered bad blocks.
 
 And clearing this list when the sector is overwritten, as it will almost
 certainly be relocated at the disk level.  For that matter, a huge win
 would be to have the MD RAID layer rewrite only the bad sector (in hopes
 of the disk relocating it) instead of failing the whiole disk.  Otherwise,
 a few read errors on different disks in a RAID set can take the whole
 system offline.  Apologies if this is already done in recent kernels...

And having a way of making this list available to both the filesystem
and to a userspace utility, so they can more easily deal with doing a
forced rewrite of the bad sector, after determining which file is
involved and perhaps doing something intelligent (up to and including
automatically requesting a backup system to fetch a backup version of
the file, and if it can be determined that the file shouldn't have
been changed since the last backup, automatically fixing up the
corrupted data block :-).

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html