Re: plucker file compression
> The reason for the extra reduction is that there is some redundancy > between files. For instance, they probably have similar headers and > footers. Another option, albeit slower (but still available, and GPL) is rzip. It is based on the same checksum routines that rsync uses (weak 32-bit and stronger 128-bit algorithms), which can find redundant parts, and not re-compress them. For example, if you have 10 home directories, and 5 users have the same Linux kernel source unpacked in their home directory, rzip can detect that and will just not compress the "redundant" data over again. Very slick stuff. If you're plucking 5,000 pages, and the header, footer, and graphics are all the same around the page ornaments, rzip can omit those redundant parts, and only compress the ones which "differ". d. ___ plucker-dev mailing list [EMAIL PROTECTED] http://lists.rubberchicken.org/mailman/listinfo/plucker-dev
plucker file compression
Nathan Bullock: > I have a plucker document with about 1600 html pages, > they range from 1k - 37k. The vast majority are > between 2k - 10k. > Various compression techniques: > 1. Total Raw Bytes 8Mb. > 2. Total gzipped 3.1Mb (Each file individually > compressed). > 3. Total tar gzipped 2.3Mb The reason for the extra reduction is that there is some redundancy between files. For instance, they probably have similar headers and footers. You could get a similar reduction by using a custom dictionary. Then you would only need to parse this dictionary (once) plus the desired record, instead of everything-up-to-the-record. The zlib spec does allow for a custom dictionary, but (last I checked) this didn't seem to be implemented in the standard open source zlib.[1] It is "application- specific". We would also have to decide whether to use a (or several?) plucker-custom dictionary or a per/pdb dictionary with a special magic record number, or both. [1] http://www.gzip.org/zlib/ suggests that it is there by 1.1.3 (which we use), but was improved since then. -jJ ___ plucker-dev mailing list [EMAIL PROTECTED] http://lists.rubberchicken.org/mailman/listinfo/plucker-dev
plucker file compression
Hello, I have been looking into how the compression for plucker could be improved. And here are some numbers... I have a plucker document with about 1600 html pages, they range from 1k - 37k. The vast majority are between 2k - 10k. Various compression techniques: 1. Total Raw Bytes 8Mb. 2. Total gzipped 3.1Mb (Each file individually compressed). 3. Total tar gzipped 2.3Mb If I understand correctly plucker gzips (zlib) each file individually and then puts them all into one big pdb file. (Note that when I pluck these html files the pdb file is 3.1Mb the same size as option 2.) This is good because it means that you don't have to decompress the entire 8 megs of data in order to retrieve the file you want. Bad because gzip doesn't compress 1k files nearly as well as 8meg files. Now I ran a little experiment where I took chunks of those small files and tarred them into bigger files (each of which was still smaller than 32k) and then gzipped these slightly larger files. This resulted in a total compressed size of 2.6Mb. A fairly good reduction from 3.1Mb. I used a very simple algorithm to determine which files to add together, possibly a better bin packing algorithm would see even better improvement. Anyway could this be used with plucker. Could the file format be modified to put a number of small files into one 32k file, and then have plucker handle the unzipping and extracting of the proper file? If someone could tell me what they would like the file format to be like I think I could handle the python side of things. But I am not sure about the palm side of things. Nathan Bullock __ Post your free ad now! http://personals.yahoo.ca ___ plucker-dev mailing list [EMAIL PROTECTED] http://lists.rubberchicken.org/mailman/listinfo/plucker-dev