Re: De-dup ratio's

Josh Davis Fri, 12 Nov 2010 09:25:41 -0800

I vote +1 for low dedupe ratios being due to precompressed data:
* Even MS Office files are actually ZIP files now.
* Windows keeps gigs of installers, which are mostly precompressed cabinets
* Many application data dumps are precompressed
* All practical media files are precompressed
* Many file servers contain a large amount of the above datatypes, plus tgz
or zip or rar or 7z or whatever as snapshots
* Many TSM environments enable client side compression, and a few enable
client-side encryption.
* TSM already does basic deduplication by using incremental strategy on a
file level.


If all of your OS images are clones of a golden image, then it helps a
little, even with noncompressible data.
Using gzip's option --rsyncable or any other content or dedupe aware options
can sometimes help a little
If using TSM client side compression (for bandwidth reasons), then TSM
client-side dedupe can see through that.
As the others have already stated, the best option is to separate out your
non-compressible data.

Dedupe is just compression with a very large dictionary.  Recompressing
doesn't work very well most of the time.  Even deduplicating multiple
versions of a document is tough with compressed XML formats.  You change the
file, recompress it, and the dictionary changes.  Because of that, the end
payload is vastly different.    For example, make a word docx that's a
couple of megs.  Modify it in several places, re-save it.  Then, try to zip
the two together, and you won't get a 45% savings.

Re: De-dup ratio's

Reply via email to