[htdig3-dev] Re: db & prefix compression [#948]

loic Sat, 24 Jul 1999 23:27:44 -0700

Sleepycat Software writes:
 > 
 > If you specified the default prefix function as your prefix function, and
 > got different results than specifying no prefix function, I'm interested
 > in tracking that down, that sounds like a problem.

 Sorry for not being clear. Here is what I do :

 1) Leave the defaults (i.e. the default prefix function is on)
    Get 548 internal pages.
 2) Assign the compare function (to point to the default compare function)
    and leave the prefix function to 0 (i.e. no prefix function is used)
    Get 546 internal pages.

 >      www.sleepycat.com
 > 
 > The latest download is 2.7.5.

 Thanks.

 > That's hard.  Writing a buffer pool to have variable sized pages is
 > a nasty little problem.  I remember one paper that's probably worth
 > finding: I think it was SOSP 1993, and the work was done at DECSRC.
 > The title was something about compression in a log-structured filesystem.
 > (If you can't find it, let me know, I can probably dig up my copy.)

 Never mind. I was already half sure that handling transparent compression
was beyond the capabilities of the current db. It would certainly be 
worth reading the code of file systems that compress data on the fly
(there is one somewhere).

 > Anyway, I haven't done an implementation, so I can't say for sure, but
 > I seriously doubt this approach will work well.
 > 
 > My inclination is to do the Huffman encoding -- that's going to be
 > relatively straight-forward and is useful regardless of doing the
 > other work.  Plus, it should be straight-forward to calculate how
 > much compression you'll get by selecting an encoding, and then use
 > English character frequencies to determine the savings.
 > 

 Let me see if I understand well. You think that compressing the data
in a page using Huffman with a static pre-determined frequency table based
on english characters frequency will be easier. The reason you think it
will be easier is that the size of a compressed entry will be the same, 
regardless of it's context. 
 I agree with you on this point. When we have this functionality it should
be quite straight forward to also implement a function that builds a 
frequency table based on the actual content of the data and rebuild the
db file with this table. It would allow someone to do the following:

        . Build a db file using default frequency table
        . When the db file is built calculate the actual frequency table
        . Rebuild the db file from scratch using the actual frequency table

      This is likely to work very well. Assuming that I push URL contents
in the DB file, if I build a db file containing 100 000 URLs, it is very
likely that the next 10 000 URLs I will push in the db file have the same
frequency table than the first 100 000 URLs. 
      In other words we are likely to have a near optimal compression rate
if we can recalculate the frequency table from time to time. Of course this
operation will cost a lot, but the savings will be big. It could be a
functionality of db_dump and db_load.
      Using this technique we can probably expect a 50% compression rate.

-- 
                Loic Dachary

                ECILA
                100 av. du Gal Leclerc
                93500 Pantin - France
                Tel: 33 1 56 96 09 80, Fax: 33 1 56 96 09 61
                e-mail: [EMAIL PROTECTED] URL: http://www.senga.org/


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.
[htdig3-dev] Re: db & prefix compression [#948]

Reply via email to