At 9:35 AM +0100 12/6/00, [EMAIL PROTECTED] wrote:
>Is there anything wrong with our db files? htsearch seems to be able
>to use them, though. Am I missing something?

No, but I don't think you want to use the db_dump programs to deal 
with them. In particular, ht://Dig "serializes" the documents in the 
document DB and can compress the excerpts, so large parts will come 
out in binary.

>Why do I want to "edit" the db files at all? The reason is that we have
>a large database with quite a number of things we'd like to exclude
>from the search results. The obvious solution would be to exclude them
>from the dig in the first place. But I don't consider this possible
>because a) this would make the config quite bulky

You can always include a file in the config file, e.g.:
exclude_urls: `/path/to/patterns`

In the 3.2 code, you can do limited editing with the new htdump and 
htload programs. On the other hand, if you just want to delete URLs, 
it's much easier with the new htpurge program instead.

>PS: How does htdig handle the case where a document is in the docs database
>but the corresponding URL is added to the exclude list? Will the document
>be deleted from the db on the next update run, or would I have to delete the
>db and run a "full index" again?

The exclude_urls pattern set is only used when considering whether to 
index a new URL. So if a URL is already in the database, it will not 
be removed. There is a similar, but more serious problem, if a 
document is added to the robots.txt file. In both cases, the code is 
upholding the "letter of the law," but it's a bit hazy.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  <http://www.htdig.org/mail/menu.html>
FAQ:            <http://www.htdig.org/FAQ.html>

Reply via email to