Re: Data De-duplication

Chris Mason Thu, 11 Dec 2008 07:13:07 -0800

On Wed, 2008-12-10 at 17:53 +0000, Oliver Mattos wrote:
> > > 2)  Keep a tree of checksums for data blocks, so that a bit of data can
> > > be located by it's checksum.  Whenever a data block is about to be
> > > written check if the block matches any known block, and if it does then
> > > don't bother duplicating the data on disk.  I suspect this option may
> > > not be realistic for performance reasons.
> > > 
> > 
> > When compression was added, the writeback path was changed to make
> > option #2 viable, at least in the case where the admin is willing to
> > risk hash collisions on strong hashes.  When the a direct read
> > comparison is required before sharing blocks, it is probably best done
> > by a stand alone utility, since we don't want wait for a read of a full
> > extent every time we want to write on.
> > 
> 
> Can we assume hash collisions won't occur?  I mean if it's a 256 bit
> hash then even with 256TB of data, and one hash per block, the chances
> of collision are still too small to calculate on gnome calculator.


It depends on the use case.  We can assume that if someone really wants
collisions to occur they will eventually be able to trigger them.  So,
the code should have the option to verify the bytes are identical via  a
read.

For a backup farm or virtualization dataset, I personally wouldn't use
the read-verify stage.  But others will.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Data De-duplication

Reply via email to