Re: plucker file compression

2004-03-03 Thread David A. Desrosiers

> The reason for the extra reduction is that there is some redundancy
> between files.  For instance, they probably have similar headers and
> footers.

Another option, albeit slower (but still available, and GPL) is
rzip. It is based on the same checksum routines that rsync uses (weak
32-bit and stronger 128-bit algorithms), which can find redundant parts,
and not re-compress them.

For example, if you have 10 home directories, and 5 users have the
same Linux kernel source unpacked in their home directory, rzip can detect
that and will just not compress the "redundant" data over again. Very
slick stuff.

If you're plucking 5,000 pages, and the header, footer, and
graphics are all the same around the page ornaments, rzip can omit those
redundant parts, and only compress the ones which "differ".

d.

___
plucker-dev mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-dev


plucker file compression

2004-03-03 Thread Jewett, Jim J
Nathan Bullock:

> I have a plucker document with about 1600 html pages,
> they range from 1k - 37k. The vast majority are
> between 2k - 10k.

> Various compression techniques:
> 1. Total Raw Bytes 8Mb.
> 2. Total gzipped 3.1Mb (Each file individually
> compressed).
> 3. Total tar gzipped 2.3Mb

The reason for the extra reduction is that there is
some redundancy between files.  For instance, they
probably have similar headers and footers.

You could get a similar reduction by using a custom
dictionary.  Then you would only need to parse this
dictionary (once) plus the desired record, instead
of everything-up-to-the-record.

The zlib spec does allow for a custom dictionary, but
(last I checked) this didn't seem to be implemented 
in the standard open source zlib.[1]  It is "application-
specific".  We would also have to decide whether to
use a (or several?) plucker-custom dictionary or a 
per/pdb dictionary with a special magic record number,
or both.

[1] http://www.gzip.org/zlib/ suggests that it is there
by 1.1.3 (which we use), but was improved since then.

-jJ
___
plucker-dev mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-dev


plucker file compression

2004-03-01 Thread Nathan Bullock
Hello,

I have been looking into how the compression for
plucker could be improved. And here are some
numbers...

I have a plucker document with about 1600 html pages,
they range from 1k - 37k. The vast majority are
between 2k - 10k.

Various compression techniques:
1. Total Raw Bytes 8Mb.
2. Total gzipped 3.1Mb (Each file individually
compressed).
3. Total tar gzipped 2.3Mb

If I understand correctly plucker gzips (zlib) each
file individually and then puts them all into one big
pdb file. (Note that when I pluck these html files the
pdb file is 3.1Mb the same size as option 2.) This is
good because it means that you don't have to
decompress the entire 8 megs of data in order to
retrieve the file you want. Bad because gzip doesn't
compress 1k files nearly as well as 8meg files.

Now I ran a little experiment where I took chunks of
those small files and tarred them into bigger files
(each of which was still smaller than 32k) and then
gzipped these slightly larger files. This resulted in
a total compressed size of 2.6Mb. A fairly good
reduction from 3.1Mb. I used a very simple algorithm
to determine which files to add together, possibly a
better bin packing algorithm would see even better
improvement.

Anyway could this be used with plucker. Could the file
format be modified to put a number of small files into
one 32k file, and then have plucker handle the
unzipping and extracting of the proper file? If
someone could tell me what they would like the file
format to be like I think I could handle the python
side of things. But I am not sure about the palm side
of things.

Nathan Bullock




__ 
Post your free ad now! http://personals.yahoo.ca
___
plucker-dev mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-dev