On Thu, 7 Apr 2011 11:57:37 -0400 Venkatesh Srinivas <[email protected]> wrote:
> On Thu, Apr 7, 2011 at 10:27 AM, Adam Hoka <[email protected]> wrote: > > Please see my proposal: > > > > http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/ahoka/1# > > Hi! > > I'll take a look at your proposal in just a bit. Here are some things > you might want to think about when looking at RAID1 though... > > Here are some details about how I planned to do dmirror and why I > think RAID1 is a much more difficult problem than it seems at first > glance. > > Imagine a RAID1 of two disks, A and B; you have an outstanding set of > I/O operations, buf1, buf2, buf3, buf4*, buf5, buf6, buf7, buf8*. The > BUFs are a mix of READ and WRITEs. At some point, your friendly > neighborhood DragonFly developer walks over and pulls the plug on your > system (you said you were running NetBSD! Its a totally valid excuse! > :)) Yes, this is indeed an important design aspect. > Each of the write bufs could be totally written, partially written, or > not written at all to each of the disks. More importantly, each disk > could have seen and completed (or not completed) the requests in a > different order. And this reorder can happen after the buf has been > declared done and biodone() has been called (and we've reported > success to userland). This could be because of reordering or > coalescing and the drive controller or the drive, for example. On thing I learned from writing the flash block driver: never biodone i/o that is not really finished. You can do it, but its a PITA to handle. Also, its the simplest to do write syncronously, we can report success when the fastest write is done or wait for the slower. > So in this case, lets say disk A had seen and totally written buf2 and > partially written buf1. Disk B had seen and totally written buf1 and > not seen buf2. And we'd reported success to the filesystem above > already. > > So when we've chastised the neighborhood DragonFly developer and > powered on the system, we have a problem. We have two halves of a RAID > mirror that are not in sync. The simplest way to sync them would be to > declare one of the two disks correct and copy one over the other > (possibly optimizing the copy with a block bitmap, as you suggested > and as Linux's MD raid1 (among many others) implement; block bitmaps > are more difficult than they seem at first [1]). > > So lets declare disk A as correct and copy it over disk B. Now, disk > B's old copy of buf2->block is overwritten with the correct copy from > disk A and disk B's correct, up-to-date copy of buf1->block is > overwritten with an scrambled version of buf1->. This is not okay, > because we'd already reported success at writing both buf1 and buf2 to > the filesystem above. > Oops. > > This failure mode has always been possible in single-disk > configurations where write reordering is possible; file systems have > long had a solitary tool to fight the chaos, BUF_CMD_FLUSH. A FLUSH > BUF acts as a barrier, it does not return until all prior requests > have completed and hit media and does not allow requests from beyond > the FLUSH point to proceed until all requests prior to the barrier are > complete [2]. However the problem multi-disk arrays face is that disks > FLUSH independently. [3: important sidebar if you run UFS]. A FLUSH on > disk X says nothing about the state of disk Y and says nothing about > selecting disk Y after power cycling. NAND bad block tables are versioned kinda like you describe in the following: each copy has a number, which is increased on update. This increase only happens after a successful upgrade. On startup, you look for the highest version. Also CRC could be useful too, if the CRC doesnt match, its a dirty block. > --- > > The dmirror design I was working on solved the problem through > overwhelming force -- adding a physical journal and a header sector to > each device. Each device would log all of the blocks it was going to > write to the journal. It would then complete a FLUSH request to ensure > the blocks had hit disk. Only then would we update the blocks we'd > meant to. After we updated the target blocks, we would issue another > FLUSH command. Then we'd update a counter in a special header sector. > [assumption: writes to single sectors on disk are atomic and survive > DragonFly developers removing power]. Each journal entry would contain > (the value of the counter)+1 before the operations were complete. To > know if a journal entry was correctly written, each entry would also > include a checksum of the update it was going to carry out. > > The recovery path would use the header's counter field to determine > which disk was most current. It would then replay the necessary > journal entries (entries with a counter > the header->counter) to > bring that device into sync (perhaps it would only replay these into > memory into overlay blocks, I'd not decided) and then sync that disk > onto all of the others. > > Concretely, from dmirror_strategy: > /* > * dmirror_strategy() > * > * Initiate I/O on a dmirror VNODE. > * > * READ: disk_issue_read -> disk_read_bio_done -> (disk_issue_read) > * > * The read-path uses push_bio to get a new BIO structure linked to > * the BUF and ties the new BIO to the disk and mirror it is issued > * on behalf of. The callback is set to disk_read_bio_done. > * In disk_read_bio_done, if the request succeeded, biodone() is called; > * if the request failed, the BIO is reinitialized with a new disk > * in the mirror and reissued till we get a success or run out of disks. > * > * WRITE: disk_issue_write -> disk_write_bio_done(..) -> disk_write_tx_done > * > * The write path allocates a write group and transaction structures for > * each backing disc. It then sets up each transaction and issues them > * to the backing devices. When all of the devices have reported in, > * disk_write_tx_done finalizes the original BIO and deallocates the > * write group. > */ > > A write group was the term for all of the state associated with a > single write to all of the devices. A write transaction was the term > for all of the state associated with a single write cycle to one disk. > > Concretely for write groups and write transactions: > > enum dmirror_write_tx_state { > DMIRROR_START, > DMIRROR_JOURNAL_WRITE, > DMIRROR_JOURNAL_FLUSH, > DMIRROR_DATA_WRITE, > DMIRROR_DATA_FLUSH, > DMIRROR_SUPER_WRITE, > DMIRROR_SUPER_FLUSH, > }; > > A write transaction was guided through a series of states by issuing > I/O via vn_strategy() and transitioning on biodone() calls. At the > DMIRROR_START state, it was not yet issued to the disk, just freshly > allocated. Journal writes were issued and the tx entered the > DMIRROR_JOURNAL_WRITE state. When the journal writes completed, we > entered the JOURNAL_FLUSH state and issued a FLUSH bio. When the flush > completed, we entered the DATA_WRITE state; next the DATA_FLUSH state, > then the SUPER_WRITE and then the SUPER_FLUSH state. When the > superblock flushed, we walked to our parent write group and marked > this disk as successfully completing all of the necessary steps. When > all of the disks had reported, we finished the write group and finally > called biodone() on the original bio. > > struct dmirror_write_tx { > struct dmirror_write_group *write_group; > struct bio bio; > enum dmirror_write_tx_state state; > }; > > The write_tx_done path was the biodone call for a single write > request. The embedded bio was initialized via initbiobuf(). > > enum dmirror_wg_state { > DMIRROR_WRITE_OK, > DMIRROR_WRITE_FAIL > }; > > struct dmirror_write_group { > struct lock lk; > struct bio *obio; > struct dmirror_dev *dmcfg; /* Parent dmirror */ > struct kref ref; > /* some kind of per-mirror linkages */ > /* some kind of per-disk linkages */ > }; > > The write group tracked the state of a write to all of the devices; > the embedded lockmgr lock prevented concurrent write_tx_done()s from > operating. The bio ptr was to the original write request. The ref > (kref no longer exists, so this would be a counter now) was the number > of outstanding devices. The per-mirror and per-disk linkages allowed a > fault on any I/O operation to a disk in the mirror to prevent any > future I/O from being issued to that disk; the code on fault would > walk all of the requests and act as though that particular write TX > finished with a B_ERROR buffer. > > The disk read path was simpler -- a single disk in the mirror was > selected and vn_strategy() called. The biodone callback checked if > there was a read error; if so, we faulted the disk and continued > selecting mirrors to issue to until we found one that worked. Each > faulted disk had outstanding I/Os killed. > > I had not given thought as to what to do when a mirror was running in > a degraded configuration or with an unsynced disk trying to catch up; > the latter requires care in that the unsynced disk can serve reads by > not writes. Also about what to do to live remove a disk. Or how to > track all of the disks in a mirror. (It'd be nice to have each disk > know all the other mirror components via UUID or something and to > record the last counter val it knew about for the other disk. This > will prevent disasters where each disk in a mirror is run > independently in a degraded setup and then brought back together.) > > AFAIK, no RAID1 is this paranoid (sample set: Linux MD, Gmirror, ccd). > And it is a terrible design from a performance perspective -- 3 FLUSH > BIOs for every set of block writes. But it does give you a hope of > correctly recovering your RAID1 in the event of a powercycle, crash, > or disk failure... > > Please tell me if this sounds crazy, overkill, or is just wrong! Or if > you want to work on this or would like to work on a classic bitmap + > straight mirror RAID1. Imho this is an overkill, as you want to gurantee things that is not the function of this layer of abstraction. :-) > -- vs > > [1]: A block bitmap of touched blocks requires care because you must > be sure that before any block is touched, the bitmap has that block > marked. Sure in the sense that the bitmap block update has hit media. > > [2]: I've never seen seen exactly what you can assume about > BUF_CMD_FLUSH (or BIO_FLUSH as it might be known as in other BSDs)... > this is a strong set of assumptions, I'd love to hear if I'm wrong. > > [3]: UFS in DragonFly and in FreeBSD does not issue any FLUSH > requests. I have no idea how this can be correct... I'm pretty sure it > is not. -- NetBSD - Simplicity is prerequisite for reliability
