>>>>> " " == Otto Wyss <[EMAIL PROTECTED]> writes:
>> > gzip --compress-like=old-foo foo >> >> AFAIK thats NOT possible with gzip. Same with bzip2. >> > Why not. gzip creates a dictionary (that gets realy large) of strings that are used and encodes references to them. At the start the dictionary is empty, so the first char is pretty much unencoded and inserted into the dictionary. The next char is encoded using the first one and so on. That way longer and longer strings enter the dictionary. Every sequence of bytes creates an unique (maybe not unique, but pretty much so) dictionary that can be completly reconstructed from the compressed data. Given the dictionary after the first n characters the n+1's characted can be decoded and the next dictionary can be calculated. I think its pretty much impossible to find two files resulting in the same dictionary. It certainly is impossible for the speed we need. You cannot encoded two random files with the same dictionary, not without adding the dictionary to the file, which gzip does not (since its a waste). So, as you see, that method is not possible. But there is a little optimisation that can be used (and is used by the --rsyncable patch): If the dictionary gets to big, the compression ratio drops. It becomes ineffective. Then gzip flushes the dictionary and starts again with an empty one. The --rsyncable patch now changes the moments when that will happen. It looks for block of bytes that have a certain rolling checksum and if it matches it flushes the dictionary. Most likely two similar files will therefore flush the dictionary at exactly the same places. If two files are equal after such a flush, the data will be encoded the same way and rsync can match those blocks. The author claims that it takes about 0.1-0.2% more space for rsyncable gzip files, which is a loss I think everybody is willing to pay. >> I wish it where that simple. >> > I'm not saying it's simple, I'm saying it's possible. I'm not a > compression speciallist but from the theory there is nothing > which prevents this except from the actual implementation. > Maybe it's time to design a compression alogrithmus which has > this functionality (same difference rate as the source) from > the ground up. There are such algorithms, but they eigther allys use the same dictionary or table (like some i386.exe runtime compressors that are specialiesed to the patterns used in opcodes) or they waste space by adding the dictionary/table to the compressed file. Thats a huge waste with all the small diff files we have. The --rsyncable patch looks promising for a start and will greatly reduce the downloads for source mirrors, if its used. MfG Goswin