I vote +1 for low dedupe ratios being due to precompressed data: * Even MS Office files are actually ZIP files now. * Windows keeps gigs of installers, which are mostly precompressed cabinets * Many application data dumps are precompressed * All practical media files are precompressed * Many file servers contain a large amount of the above datatypes, plus tgz or zip or rar or 7z or whatever as snapshots * Many TSM environments enable client side compression, and a few enable client-side encryption. * TSM already does basic deduplication by using incremental strategy on a file level.
If all of your OS images are clones of a golden image, then it helps a little, even with noncompressible data. Using gzip's option --rsyncable or any other content or dedupe aware options can sometimes help a little If using TSM client side compression (for bandwidth reasons), then TSM client-side dedupe can see through that. As the others have already stated, the best option is to separate out your non-compressible data. Dedupe is just compression with a very large dictionary. Recompressing doesn't work very well most of the time. Even deduplicating multiple versions of a document is tough with compressed XML formats. You change the file, recompress it, and the dictionary changes. Because of that, the end payload is vastly different. For example, make a word docx that's a couple of megs. Modify it in several places, re-save it. Then, try to zip the two together, and you won't get a 45% savings.