> Simple is good.  Even for deduplication alone, I think data integrity is
> critical - otherwise we risk stale dedup metadata pointing to clusters
> that are unallocated or do not contain the right data.  So the journal
> will probably need to follow techniques for commits/checksums.

I agree that checksums are missing for the dedup.
Maybe we could even use some kind of error correcting code instead of a 
checksum.

Concerning data integrity the events that the deduplication code cannot loose
are hash deletions because they mark a previously inserted hash as obsolete.

The problem with a commit/flush mechanism on hash deletion is that it will slow
down the store insertion speed and also create some extra SSD wear out.

To solve this I considered the fact that the dedup metadata as a whole is
disposable.

So I implemented a "dedup dirty" bit.

When QEMU stop the journal is flushed and the dirty bit is cleared.
When QEMU start and the dirty bit is set a crash is detected and _all_ the
deduplication metadata is dropped.
The QCOW2 data integrity won't suffer only the dedup ratio will be lower.

As you said once on irc crashes don't happen often.

Benoît


Reply via email to