[EMAIL PROTECTED] writes:
 > 
 > 
 > Hi everyone,
 > 
 > I'm currently  working on compression  code in mifluz that  will maybe
 > divide the index size by 8 :-)
 > 

 I'd like to say that the index size WILL be divided by 8 with Marcel
compression function. It's not like if it's a very complex and delicate
piece of code. On the contrary, the code is very simple and straight 
forward. We worked yesterday to finish the integration of the code
in Berkeley DB. The idea is to be able to specify a compression function
to Berkeley DB. The default compression function is gzip (from zlib) that
compresses the Berkeley DB file to 1/2. The compression function wrote
by Marcel is specific to htdig and compresses the Berkeley DB file to 1/8.

 As of now the compression function is working, the Berkeley DB integration
requires about 1 day work and the packaging, cleaning, testing in real 
conditions requires about 5 days work. Hopefully the changes will be 
commited late next week.

 From the statistics done by Marcel, I'm very confident that the word.db
file will be between 50% and 100% of the size of the actual data. At present
it's 800% of the actual data. Repacking (db_dump + db_load) reduces it to
400%. Compressing (using DB_COMPRESS that triggers gzip) reduces it to 200%. 
These figures are confirmed by the comparision between 3.1 and 3.2 word.db file
sizes posted on this list.

 A very nice side effect of Marcel's function is that we won't need a
Berkeley DB repacker. When inserting words at random in a Berkeley DB 
file, you consistently loose 50% of the disk space (the pages are only
half filled). My bet (not checked yet) is that we won't gain anything by
repacking because there will be more data in the pages and the compression
rate will be lower (1/4). Repacking will reduced by 1/2 but compression will
only be 1/4. Not repacking we will reduce to 1/8 with compression. No need to
repack, then. Of course there is not only the issue of disk space. Not
repacking will imply more internal pages in the Berkeley DB tree and this
is bad for performances. 

   Cheers,
 
-- 
                Loic Dachary

                ECILA
                100 av. du Gal Leclerc
                93500 Pantin - France
                Tel: 33 1 56 96 10 85
                e-mail: [EMAIL PROTECTED]
                URL: http://www.senga.org/


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.

Reply via email to