[htdig-dev] Tow proposals for Improving the size and reliability of the WordDB

Neal Richter Thu, 19 Sep 2002 12:55:10 -0700

Hey all,

        I've got two proposals here for the WordDB:


1.  Add a new config verb to let users use zlib WordDB-page compression.
        This would be an option to let users who run into this error:

FATAL ERROR:Compressor::get_vals invalid comptype
FATAL ERROR at file:WordBitCompress.cc line:827 !!!

 If you look into the db/mp_cmpr.c code (Loic's Compressed BDB page code)
 you'll find these two functions:
        CDB___memp_cmpr_inflate(..)
        CDB___memp_cmpr_defalte(...)

 They are drop in zlib-based replacements for the
(*cmpr_info->uncompress) & (*cmpr_info->compress) function-pointer calls.

 Yes, the compression isn't as good as the ad-hoc bit-stream compression
in WordDBCompress, WordBitCompress, and WordDBPage.  The advantage is that
its fairly bulletproof (zlib) and better than turning off wordDB
compression altogether with the  'wordlist_compress' config verb.

   Merging Loic's latest mifluz is supposed to fix this problem (Geoff
and I have been working on this), but so far the merge is fairly complex
and needs much more work and long term testing.  This is a decent
solution.


2.  The inverted index is not very efficient in general.

The current scheme:

WORD    DOCID   LOCATION
affect  323    43  
affect  323    53  
affect  336    14  
affect  336    148 
affect  336    155 
affect  351    43  
affect  358    370 
affect  399    51  
affect  400    10  
affect  400    86  
affect  400    95  
affect  400    139 
affect  400    215 
affect  400    222 
affect  400    229 

A more efficient inverted system 

affect  323    43, 53
affect  336    14, 148, 155
affect  351    43  
affect  358    370 
affect  399    51  
affect  400    10, 86, 95, 139, 215, 222, 229

We would need to augment the WordDB and associated classes to support the
value parsing..

We would also be able to avoid any dynamic resizing of the LOCATION
Value-field in BDB by making it a fixed width.

Ex: Let's say this LOCATION-value is 'Full' @ 32 characters.  Further
locations of 'affect' in doc 400 get new rows

affect  400    10, 86, 95, 139, 215, 222, 229
affect  400    300, 322, 395, 439, 516

The objects would keep track of the field lengths and create new rows as
needed.

If the fixed width Location field was around 256 characters, this would
allow roughly 40-50 1,2,3 & 4 digit location codes... likely resulting the
vast majority of the time a second row is not needed.  For large
documents, this would change but still be much more efficient.

Eh? Feedback?


EXTRA NOTE: Memory Leak detection:

I also wanted to make the developers aware (if you aren't already) of
Valgrind.  It's nice open-source memory error checking tool.

In general you use it like this:

Valgrind htdig xxx xxx xxx

It seems to be pretty comparable to both Purify and Insure.  It's not
going to get you as good a result as a compile-or-link time code
instrumentation, but its better than nothing at all.  Interestingly Isure
ships a program called 'Chaperon' that you use in the same way as
Valigrind on Debug binaries.  I haven't looked into it indetail, but my
guess is that both build on the native memory debugging facilities of
glibc.

Valgrind Home Page
http://developer.kde.org/~sewardj/

KDE GUI frontend to Valgrind
http://www.weidendorfers.de/kcachegrind/index.html

-- 
Neal Richter 
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485







-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

[htdig-dev] Tow proposals for Improving the size and reliability of the WordDB

Reply via email to