On Thu, 7 Apr 2011 16:21:00 -0600 Toby <[email protected]> wrote:
> Where to begin? I really want to say some things, but life is a bit hectic > at > the moment. I'll try to bring things up in a point-ish form, and maybe > someone > can comment and/or run with it. > > 1) Most disks today have the ability to have some form of read/write > sync/barriers, > in particular, NCQ, TCQ, etc. Unix has traditionally not had support (just > like TRIM) > for such concepts in the mid-layer. It really should, to help give current > filesystems > a fighting chance at not needing a full "wait until this synchronous > write/cache flush > has come back" type of operation to hit a simple ordering sync point during > write. The > same concepts hold true for most RAID systems. > > 2) Once the appropriate sync style tagged commands/etc are sent, assume that > the > disk will get it to media (possibly NVRAM, super-cap, whatever). Assume > that if it > actually does not make it to media it is data corruption, in other words, it > is as if it > had hit media, but then lots it's bits (marbles?). The only way to really > combat that > is to have a decent checksumming algorithm to check the data that comes back > from > disk, the next time you read from it. > > 3) As part of the disk scheduling infrastructure, it would be nice to keep > statistics on > the latencies of each disk's operation(s). This will help in making > decisions both on > how blocks get scheduled to disks and in help to keep disk I/O channels from > being > saturated, possibly bounding latency for I/O operations per process/etc. > But, it also > provides for possibly helping in scheduling the best options/ordering for > RAID style I/O. > It may also give advance indication of pending doom... consumer SATA disks > tend to > have very high I/O retry counts... so if operations take time significantly > outside of > some historic deviation for the device, we may be able to give a warning to > the user. > > 4) One possible method of mitigating a RAID1 (possibly other RAID levels > too), is to > use one sector (usually 512 bytes) per stripe on each disk. So, if you have > a 2 disk > RAID1 with 64K stripes, you end up with 2*512/63*1024 = 0.0156, or less than > 2% of > overhead. If you only do full-stripe writes (and reads), you end up with > the possibility > of doing a couple things: > > 4a) Keep a checksum of each stripe. A nice 64-bit (or whatever) checksum of > the full > stripe as part of the "header" sector in each stripe (on each disk). If you > did that, you > could even keep one checksum for a number of sectors, depending on stripe > size, you > may need to keep one checksum per 2 or 3 sectors. Skew the checksum on the > other > disk(s) (RAID1-disk1 would be offset by 0, RAID1-disk2 by 2, etc). You'd > likely want to > keep some accounting information in the header sector as well to help > initialize the RAID > faster. > > 4b) Keep a rolling counter in each header. A counter that simply gets > incremented > each time a stripe is written. This counter is per full stripe (IE: the > same counter on > each disk). If this counter is 64-bits long, chances are you'll never have > to take a > full disk sync to wait for all disks in the RAID1 to update to handle the > wrap-around. > But, on the other hand, it would be fairly easy to re-construct a raid-set > with best > effort (both in terms of checksums and on temporal information). > > 5) As mentioned before, use the NCQ/TCQ style things to make sure the disk > itself > will write out the data portions of a stripe, before the stripe header. > Allow the disk > the ability to re-order data blocks, but have the write barrier in place to > order these > things. It all depends on whether the underlying layer support this kind of operations. Can somone shed light if it's possible in Dragonfly for example standard sata or scsi? -- NetBSD - Simplicity is prerequisite for reliability
