Where to begin?  I really want to say some things, but life is a bit hectic
at
the moment.  I'll try to bring things up in a point-ish form, and maybe
someone
can comment and/or run with it.

1) Most disks today have the ability to have some form of read/write
sync/barriers,
in particular, NCQ, TCQ, etc.  Unix has traditionally not had support (just
like TRIM)
for such concepts in the mid-layer.  It really should, to help give current
filesystems
a fighting chance at not needing a full "wait until this synchronous
write/cache flush
has come back" type of operation to hit a simple ordering sync point during
write.  The
same concepts hold true for most RAID systems.

2) Once the appropriate sync style tagged commands/etc are sent, assume that
the
disk will get it to media (possibly NVRAM, super-cap, whatever).  Assume
that if it
actually does not make it to media it is data corruption, in other words, it
is as if it
had hit media, but then lots it's bits (marbles?).  The only way to really
combat that
is to have a decent checksumming algorithm to check the data that comes back
from
disk, the next time you read from it.

3) As part of the disk scheduling infrastructure, it would be nice to keep
statistics on
the latencies of each disk's operation(s).  This will help in making
decisions both on
how blocks get scheduled to disks and in help to keep disk I/O channels from
being
saturated, possibly bounding latency for I/O operations per process/etc.
 But, it also
provides for possibly helping in scheduling the best options/ordering for
RAID style I/O.
It may also give advance indication of pending doom... consumer SATA disks
tend to
have very high I/O retry counts... so if operations take time significantly
outside of
some historic deviation for the device, we may be able to give a warning to
the user.

4) One possible method of mitigating a RAID1 (possibly other RAID levels
too), is to
use one sector (usually 512 bytes) per stripe on each disk.  So, if you have
a 2 disk
RAID1 with 64K stripes, you end up with 2*512/63*1024 = 0.0156, or less than
2% of
overhead.  If you only do full-stripe writes (and reads), you end up with
the possibility
of doing a couple things:

4a) Keep a checksum of each stripe.  A nice 64-bit (or whatever) checksum of
the full
stripe as part of the "header" sector in each stripe (on each disk).  If you
did that, you
could even keep one checksum for a number of sectors, depending on stripe
size, you
may need to keep one checksum per 2 or 3 sectors.  Skew the checksum on the
other
disk(s) (RAID1-disk1 would be offset by 0, RAID1-disk2 by 2, etc).  You'd
likely want to
keep some accounting information in the header sector as well to help
initialize the RAID
faster.

4b) Keep a rolling counter in each header.  A counter that simply gets
incremented
each time a stripe is written.  This counter is per full stripe (IE: the
same counter on
each disk).  If this counter is 64-bits long, chances are you'll never have
to take a
full disk sync to wait for all disks in the RAID1 to update to handle the
wrap-around.
But, on the other hand, it would be fairly easy to re-construct a raid-set
with best
effort (both in terms of checksums and on temporal information).

5) As mentioned before, use the NCQ/TCQ style things to make sure the disk
itself
will write out the data portions of a stripe, before the stripe header.
 Allow the disk
the ability to re-order data blocks, but have the write barrier in place to
order these
things.


-Toby.

Reply via email to