Hey all,
I've got two proposals here for the WordDB:
1. Add a new config verb to let users use zlib WordDB-page compression.
This would be an option to let users who run into this error:
FATAL ERROR:Compressor::get_vals invalid comptype
FATAL ERROR at file:WordBitCompress.cc line:827 !!!
If you look into the db/mp_cmpr.c code (Loic's Compressed BDB page code)
you'll find these two functions:
CDB___memp_cmpr_inflate(..)
CDB___memp_cmpr_defalte(...)
They are drop in zlib-based replacements for the
(*cmpr_info->uncompress) & (*cmpr_info->compress) function-pointer calls.
Yes, the compression isn't as good as the ad-hoc bit-stream compression
in WordDBCompress, WordBitCompress, and WordDBPage. The advantage is that
its fairly bulletproof (zlib) and better than turning off wordDB
compression altogether with the 'wordlist_compress' config verb.
Merging Loic's latest mifluz is supposed to fix this problem (Geoff
and I have been working on this), but so far the merge is fairly complex
and needs much more work and long term testing. This is a decent
solution.
2. The inverted index is not very efficient in general.
The current scheme:
WORD DOCID LOCATION
affect 323 43
affect 323 53
affect 336 14
affect 336 148
affect 336 155
affect 351 43
affect 358 370
affect 399 51
affect 400 10
affect 400 86
affect 400 95
affect 400 139
affect 400 215
affect 400 222
affect 400 229
A more efficient inverted system
affect 323 43, 53
affect 336 14, 148, 155
affect 351 43
affect 358 370
affect 399 51
affect 400 10, 86, 95, 139, 215, 222, 229
We would need to augment the WordDB and associated classes to support the
value parsing..
We would also be able to avoid any dynamic resizing of the LOCATION
Value-field in BDB by making it a fixed width.
Ex: Let's say this LOCATION-value is 'Full' @ 32 characters. Further
locations of 'affect' in doc 400 get new rows
affect 400 10, 86, 95, 139, 215, 222, 229
affect 400 300, 322, 395, 439, 516
The objects would keep track of the field lengths and create new rows as
needed.
If the fixed width Location field was around 256 characters, this would
allow roughly 40-50 1,2,3 & 4 digit location codes... likely resulting the
vast majority of the time a second row is not needed. For large
documents, this would change but still be much more efficient.
Eh? Feedback?
EXTRA NOTE: Memory Leak detection:
I also wanted to make the developers aware (if you aren't already) of
Valgrind. It's nice open-source memory error checking tool.
In general you use it like this:
Valgrind htdig xxx xxx xxx
It seems to be pretty comparable to both Purify and Insure. It's not
going to get you as good a result as a compile-or-link time code
instrumentation, but its better than nothing at all. Interestingly Isure
ships a program called 'Chaperon' that you use in the same way as
Valigrind on Debug binaries. I haven't looked into it indetail, but my
guess is that both build on the native memory debugging facilities of
glibc.
Valgrind Home Page
http://developer.kde.org/~sewardj/
KDE GUI frontend to Valgrind
http://www.weidendorfers.de/kcachegrind/index.html
--
Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev