On 9/10/07, Pawel Jakub Dawidek <[EMAIL PROTECTED]> wrote:
> The problem with RAID5 is that different blocks share the same parity,
> which is not the case for RAIDZ. When you write a block in RAIDZ, you
> write the data and the parity, and then you switch the pointer in
> uberblock. For RAID5, you write the data and you need to update parity,
> which also protects some other data. Now if you write the data, but
> don't update the parity before a crash, you have a whole. If you update
> you parity before the write and a crash, you have a inconsistent with
> different block in the same stripe.

This is why you should consider "old" data and parity as being "live".
The old data (being overwritten) is live as it is needed for the
parity to be consistent - and the old parity is live because it
protects the other blocks.

What IMO should be done is object level raid - write new parity and
new data into blocks not yet used - and as the new parity protects
also the "neighbouring" data the old parity can be freed, and after it
no longer is live the "overwritten" data block can also be freed.

Note that this is very different from traditional raid5 as it requires
intimate knowledge about the FS structure. Traditional raids also keep
parity "in line" with the data blocks it protects - but that is not
necessary if the FS can store information about where the parity is
located.

Define "live data" well enough and you're safe if you never overwrite any of it.

> My idea was to have one sector every 1GB on each disk for a "journal" to
> keep list of blocks beeing updated.

This would be called "write intent log" or "bitmap" (as in linux
software raid). Speeds up recovery, but doesn't protect against write
hole problems.
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to