Where to begin? I really want to say some things, but life is a bit hectic at the moment. I'll try to bring things up in a point-ish form, and maybe someone can comment and/or run with it.
1) Most disks today have the ability to have some form of read/write sync/barriers, in particular, NCQ, TCQ, etc. Unix has traditionally not had support (just like TRIM) for such concepts in the mid-layer. It really should, to help give current filesystems a fighting chance at not needing a full "wait until this synchronous write/cache flush has come back" type of operation to hit a simple ordering sync point during write. The same concepts hold true for most RAID systems. 2) Once the appropriate sync style tagged commands/etc are sent, assume that the disk will get it to media (possibly NVRAM, super-cap, whatever). Assume that if it actually does not make it to media it is data corruption, in other words, it is as if it had hit media, but then lots it's bits (marbles?). The only way to really combat that is to have a decent checksumming algorithm to check the data that comes back from disk, the next time you read from it. 3) As part of the disk scheduling infrastructure, it would be nice to keep statistics on the latencies of each disk's operation(s). This will help in making decisions both on how blocks get scheduled to disks and in help to keep disk I/O channels from being saturated, possibly bounding latency for I/O operations per process/etc. But, it also provides for possibly helping in scheduling the best options/ordering for RAID style I/O. It may also give advance indication of pending doom... consumer SATA disks tend to have very high I/O retry counts... so if operations take time significantly outside of some historic deviation for the device, we may be able to give a warning to the user. 4) One possible method of mitigating a RAID1 (possibly other RAID levels too), is to use one sector (usually 512 bytes) per stripe on each disk. So, if you have a 2 disk RAID1 with 64K stripes, you end up with 2*512/63*1024 = 0.0156, or less than 2% of overhead. If you only do full-stripe writes (and reads), you end up with the possibility of doing a couple things: 4a) Keep a checksum of each stripe. A nice 64-bit (or whatever) checksum of the full stripe as part of the "header" sector in each stripe (on each disk). If you did that, you could even keep one checksum for a number of sectors, depending on stripe size, you may need to keep one checksum per 2 or 3 sectors. Skew the checksum on the other disk(s) (RAID1-disk1 would be offset by 0, RAID1-disk2 by 2, etc). You'd likely want to keep some accounting information in the header sector as well to help initialize the RAID faster. 4b) Keep a rolling counter in each header. A counter that simply gets incremented each time a stripe is written. This counter is per full stripe (IE: the same counter on each disk). If this counter is 64-bits long, chances are you'll never have to take a full disk sync to wait for all disks in the RAID1 to update to handle the wrap-around. But, on the other hand, it would be fairly easy to re-construct a raid-set with best effort (both in terms of checksums and on temporal information). 5) As mentioned before, use the NCQ/TCQ style things to make sure the disk itself will write out the data portions of a stripe, before the stripe header. Allow the disk the ability to re-order data blocks, but have the write barrier in place to order these things. -Toby.
