At 8:53 AM -0400 6/28/00, Terry Luedtke wrote:
>There are other ways to improve the speed while still using 
>BerkeleyDB (or any other db for that matter).  The ability to run 
>concurrent digs into the same database for one.  An htsearch that 
>stays in memory, similar to fast-cgi programs, for another.

Yup. This is one reason that the 3.2 code uses this new database 
layout. (It's hard to say "format" since it's still based on Berkeley 
DB, but it's storing the data in a different fashion.) The htdig 
crawler now generates databases in htsearch-ready format. Granted, if 
there are likely to be a large number of bad URLs or changed 
documents, it's a good idea to run the "htpurge" program to remove 
them.

The problem with concurrent digs into the same database is that it 
requires careful locking of writes to make sure the threads or 
processes do not change the same data. It's probably more useful to 
allow htsearch to browse through "collections" of data.

Yes, there are also loads of ways of speeding up htsearch even 
without converting it into a persistent CGI/servlet. Caching, of 
course, would help significantly and I'm committed to having that 
implemented.

I don't want to get too deep into a database-format discussion on 
this list. Personally, I think it would be great to have some SQL 
support if people choose to try that out. But so far, no one has 
submitted patches to the current 3.2 CVS tree to my knowledge. *I'm* 
certainly the last one to do it--I still have to finish up writing 
the new htsearch query parser!

-Geoff


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 


Reply via email to