*New lossless compression tool* I have written a lossless compression tool based on Rabin chunks that, when used in conjunction with gzip, produces files that are between 8% and 33% smaller. The median amount saved in my sample files is about 11%.
*Integration into gzip?* I would love to integrate this into gzip after an appropriate period of testing. This is an exploratory email to see if this is possible, and introduce the tool for interested parties if it's not. I'm not sure what the mandate is on gzip - what are the restrictions on the algorithm(s) used, time performance, memory performance, and temporary files. If gzip is mandated to only use the LZ77 algorithm it currently uses, or gzip is mandated to always be backwards compatible so any file compressed with gzip must be expandable by previous versions of gzip, then obviously rabin cannot be integrated into gzip. In that case you can treat the rest of this email as an announcement of a secondary compression tool that works well with gzip. Otherwise, I'd like to discuss the possibility of integrating the tool. I don't have strict memory or time performance statistics - all I can do now is give my casual observations on some large files. The response to this email will determine how much effort I spend in the immediate future gathering statistics. *Accessing the tool* I have started a github project for rabin-tool at http://github.com/wurp/rabin-tool/ You can download, make and play with it from there. See the README. You can run rabin with no arguments to see all the possible arguments. *Some performance and efficacy tests* As an example, I took a tar of the first 100mb of /usr/bin on my Debian box. Gzipped, it’s about 31 meg. If I run it through rabin first, it gzips down to about 27 meg – slightly better than 10% smaller than without using my tool. I also built a 4mb tar file of the java source for another project; the same treatment (rabin, then gzip) resulted in a file 33% smaller than gzip alone. I built an 8mb tar file of a wordpress html tree - for this the final result was only 8% smaller than gzip alone. I have found no case so far in which applying rabin resulted in a larger file than the original, although I'm sure that would be the case for multiple applications of rabin. The run time for the tool is about twice that of gzip. I have done nothing to enable the tool to do its processing while the disk read is going on or to preallocate memory, so I'm sure that time can be reduced. Obviously, time spent reading the data from the drive would not need to be repeated if the tool was integrated into gzip. My swag is that integrating a performance optimized version of the tool into gzip would result in an additional 20% runtime. The memory usage of the tool is currently not bounded. 8 bytes are used for every rabin chunk encountered; memory usage will increase linearly with the file size Limiting the memory requirement would not be difficult and my feeling is that a reasonable limit, intelligently implemented, would not significantly impact the efficacy of the tool. Memory usage currently should be about 10% of file size for reasonable compression parameters. *Tasks for which the tool is useful* The main things that it’s good for are: * compression * identifying the parts of a file (text or binary) that have changed * minimizing the data needed to store multiple versions of a file * lots of other things. For example: * you can identify which binaries are infected with a virus by knowing the hash for some of the chunks in the middle of the virus and checking the binary to see if it also contains those chunks * you can automatically detect viruses (even if you have never seen the virus before) by scanning your outgoing traffic for the same chunks going out to many different machines rabin can do simple inline compression of a file. Longish sequences of data that are repeated in the file are replaced by references to the first time that data occurs, no matter how far away the repetition is from the first occurrence. It can decompress what it compresses. The tool provides the following support to enable tasks other than inline compression: It can split a file out into chunks and throw all the chunks into a directory as separate files. Chunks are identified by their Rabin hash (currently; I may switch to another hashing algorithm). If a chunk occurs multiple times in the file, it only appears once in the directory. It can list all of the chunks, chunk lengths, and chunk ids in the file. It can do this in combination with creating the chunk files. I have a script that can reassemble chunks into the original file based on the chunks + the list of chunks; I will eventually put this functionality directly into the tool. Again, I would be happy to do the work to get this into gzip if that's something that could reasonably be done. Otherwise, I welcome suggestions (directly to me to avoid spamming the gzip list) of places I should announce the tool to make sure interested users hear about it without spamming too many people who don't care. Thanks! Bobby
