On 9/10/07, Pawel Jakub Dawidek <[EMAIL PROTECTED]> wrote: > The problem with RAID5 is that different blocks share the same parity, > which is not the case for RAIDZ. When you write a block in RAIDZ, you > write the data and the parity, and then you switch the pointer in > uberblock. For RAID5, you write the data and you need to update parity, > which also protects some other data. Now if you write the data, but > don't update the parity before a crash, you have a whole. If you update > you parity before the write and a crash, you have a inconsistent with > different block in the same stripe.
This is why you should consider "old" data and parity as being "live". The old data (being overwritten) is live as it is needed for the parity to be consistent - and the old parity is live because it protects the other blocks. What IMO should be done is object level raid - write new parity and new data into blocks not yet used - and as the new parity protects also the "neighbouring" data the old parity can be freed, and after it no longer is live the "overwritten" data block can also be freed. Note that this is very different from traditional raid5 as it requires intimate knowledge about the FS structure. Traditional raids also keep parity "in line" with the data blocks it protects - but that is not necessary if the FS can store information about where the parity is located. Define "live data" well enough and you're safe if you never overwrite any of it. > My idea was to have one sector every 1GB on each disk for a "journal" to > keep list of blocks beeing updated. This would be called "write intent log" or "bitmap" (as in linux software raid). Speeds up recovery, but doesn't protect against write hole problems. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss