Re: Data Deduplication with the help of an online filesystem check

Ric Wheeler Mon, 04 May 2009 08:02:57 -0700

On 05/04/2009 10:39 AM, Tomasz Chmielewski wrote:

Ric Wheeler schrieb:
One thing in the above scheme that would be really interesting forall possible hash functions is maintaining good stats on hashcollisions, effectiveness of the hash, etc. There has been a lot ofpress about MD5 hash collisions for example - it would be really neatto be able to track real world data on those,
See here ("The hashing function"):

http://backuppc.sourceforge.net/faq/BackupPC.html#some_design_issues
It's not "real world data", but it gives some overview which applieshere.

I have lots of first hand "real world" data from 5 years at EMC in theCentera group where we used various types of hashes to do singleinstancing at the file level, but those are unfortunately not public :-)Note that other parts of EMC do dedup at the block level (Avamar mostnotably).


The key to doing dedup at scale is answering a bunch of questions:

    (1) Block level or file level dedup?
    (2) Inband dedup (during a write) or background dedup?

(3) How reliably can you protect the pool of blocks? How reliablycan you protect the database that maps hashes to blocks?(4) Can you give users who are somewhat jaded confidence in yoursolution (this is where stats come in very handy!)(5) In the end, dedup is basically a data compression trick - howeffective is it (including the costs of the metadata, etc) compared toless complex schemes? What does it costs in CPU, DRAM and impact on theforeground workload?

Dedup is a very active area in the storage world, but you need to beextremely robust in the face of failure since a single block failurecould (worst case!) lose all stored data in the pool :-)


Regards,

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Data Deduplication with the help of an online filesystem check

Reply via email to