On Thu, 7 Apr 2011 16:21:00 -0600
Toby <[email protected]> wrote:

> Where to begin?  I really want to say some things, but life is a bit hectic
> at
> the moment.  I'll try to bring things up in a point-ish form, and maybe
> someone
> can comment and/or run with it.
> 
> 1) Most disks today have the ability to have some form of read/write
> sync/barriers,
> in particular, NCQ, TCQ, etc.  Unix has traditionally not had support (just
> like TRIM)
> for such concepts in the mid-layer.  It really should, to help give current
> filesystems
> a fighting chance at not needing a full "wait until this synchronous
> write/cache flush
> has come back" type of operation to hit a simple ordering sync point during
> write.  The
> same concepts hold true for most RAID systems.
> 
> 2) Once the appropriate sync style tagged commands/etc are sent, assume that
> the
> disk will get it to media (possibly NVRAM, super-cap, whatever).  Assume
> that if it
> actually does not make it to media it is data corruption, in other words, it
> is as if it
> had hit media, but then lots it's bits (marbles?).  The only way to really
> combat that
> is to have a decent checksumming algorithm to check the data that comes back
> from
> disk, the next time you read from it.
> 
> 3) As part of the disk scheduling infrastructure, it would be nice to keep
> statistics on
> the latencies of each disk's operation(s).  This will help in making
> decisions both on
> how blocks get scheduled to disks and in help to keep disk I/O channels from
> being
> saturated, possibly bounding latency for I/O operations per process/etc.
>  But, it also
> provides for possibly helping in scheduling the best options/ordering for
> RAID style I/O.
> It may also give advance indication of pending doom... consumer SATA disks
> tend to
> have very high I/O retry counts... so if operations take time significantly
> outside of
> some historic deviation for the device, we may be able to give a warning to
> the user.
> 
> 4) One possible method of mitigating a RAID1 (possibly other RAID levels
> too), is to
> use one sector (usually 512 bytes) per stripe on each disk.  So, if you have
> a 2 disk
> RAID1 with 64K stripes, you end up with 2*512/63*1024 = 0.0156, or less than
> 2% of
> overhead.  If you only do full-stripe writes (and reads), you end up with
> the possibility
> of doing a couple things:
> 
> 4a) Keep a checksum of each stripe.  A nice 64-bit (or whatever) checksum of
> the full
> stripe as part of the "header" sector in each stripe (on each disk).  If you
> did that, you
> could even keep one checksum for a number of sectors, depending on stripe
> size, you
> may need to keep one checksum per 2 or 3 sectors.  Skew the checksum on the
> other
> disk(s) (RAID1-disk1 would be offset by 0, RAID1-disk2 by 2, etc).  You'd
> likely want to
> keep some accounting information in the header sector as well to help
> initialize the RAID
> faster.
> 
> 4b) Keep a rolling counter in each header.  A counter that simply gets
> incremented
> each time a stripe is written.  This counter is per full stripe (IE: the
> same counter on
> each disk).  If this counter is 64-bits long, chances are you'll never have
> to take a
> full disk sync to wait for all disks in the RAID1 to update to handle the
> wrap-around.
> But, on the other hand, it would be fairly easy to re-construct a raid-set
> with best
> effort (both in terms of checksums and on temporal information).
> 
> 5) As mentioned before, use the NCQ/TCQ style things to make sure the disk
> itself
> will write out the data portions of a stripe, before the stripe header.
>  Allow the disk
> the ability to re-order data blocks, but have the write barrier in place to
> order these
> things.

It all depends on whether the underlying layer support this kind of operations.
Can somone shed light if it's possible in Dragonfly for example standard
sata or scsi?

-- 
NetBSD - Simplicity is prerequisite for reliability

Reply via email to