> From: Dan Swartzendruber [mailto:[email protected]] > > Not to nitpick, but dedup isn't really compression in one significant > respect. e.g. you can have 3 copies of the same data chunk and it is only > stored as one (effectively a compression ratio of 4:1), even if the data in > question is uncompressible (due to already being compressed.)
Try this: for f in a b c d ; do dd if=/dev/urandom of=$f bs=1k count=1 ; done Now you have four files containing random data (uncompressible.) for ((i=0; i<100; i++)) ; do cat a b c d >> final ; done Now you've taken uncompressible data, and repeated it a bunch of times. The result is compressible. gzip final I cannot say specifically if gzip will handle this compression, because I didn't bother actually running those commands on my system. It all depends on whether or not my blocksize 1k is larger or smaller than the scope of the compression tables. But I can say in principle it's compressible. Some algorithms (lzw for example) use a lookup table, and if repeated uncompressible patterns are detected, then the whole block of uncompressible data gets stored into the table, and only a table index needs to be stored in the compressed data stream. lookup table... repeated data... just store the data once (or a small number of times) and reference it multiple times... Sound familiar? Like what they do in the DDT? If you look at how DEFLATE works (part of zlib and lzw and whatever else) one of the techniques is duplicate string elimination. De-duplication, one might say. And so on. _______________________________________________ OpenIndiana-discuss mailing list [email protected] http://openindiana.org/mailman/listinfo/openindiana-discuss
