Greetings all, One of the biggest show-stoppers for 3.2.0rc1 seems to be speed. We had agreed to complete testing and then release, but ht://Dig would get a very bad reputation if we release the current slow product. I propose that we suspend testing and all concentrate on getting performance to within a factor of 2 of the 3.1.6 speed?
Some possible approaches are: 1. Keep the database format entirely unchanged, but write to it more efficiently. Neal reported good results by caching updates using the STL. That would improve the "locality" of the database accesses, both improving its own caching and reduce the performance hit from compression. Neal, could you tell us a bit more about this? 2. Give users the option to store only the first occurrence of each word in each document. That will kill phrase searching, but it should make the database smaller, and (if done correctly) eliminate most of the writes to it in the build phase. This keeps the database format nominally the same (so search/purge are unaffected). 3. Totally rewrite the database format to avoid the significant redundancy. This really should be done at some stage, and I vote the sooner the better. As I understand it, the entries are all of the format Word, Doc ID (32 bits), flags (8 bits), location (16 bits). Does anyone know how well BDB handles variable length records? If they are OK, how about a format like: Word, Doc ID (32 bits), count (16), <flags (8), offset (16)> + Here, the "offset" field is the *difference* between the locations of consecutive occurrences. These numbers will be more likely to be under 255, and so should compress better. Because entries are made for cross-references from other documents, we could allow multiple entries of that form for the same word/document, but still massively reduce the number of redundant Word/DocID fields, and (more importantly?) the number of database writes. Could I ask for a "show of hands" of people who can help here? We need people who know why the current database format was selected, understand the current code, understand BDB or are willing to help code. (I'm only in the last of those categories, unfortunately.) Cheers, Lachlan -- [EMAIL PROTECTED] ht://Dig developer DownUnder (http://www.htdig.org) ------------------------------------------------------- This SF.Net email is sponsored by: IBM Linux Tutorials Free Linux tutorial presented by Daniel Robbins, President and CEO of GenToo technologies. Learn everything from fundamentals to system administration.http://ads.osdn.com/?ad_id70&alloc_id638&op=click _______________________________________________ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev
