My point was that if I specify a maximum document limit of 150, I
would like htdig not to just count to 150 in terms of how many
links it attempted to crawl, but rather, 150 full documents.
Basically, 40 of my documents redirect to outside servers which
are not to be crawled, but htdig STILL increments the counter.

Where this seems perfectly acceptable to most, I am hoping someone
could give me a tip/patch that would NOT increment the total_index
counter when it encounters a 300-series redirect HTTP message; or
even increment the server_max_docs by one just to compensate and
not interfere with any DocumentId down the line.

My goal is to crawl and record exactly the "server_max_docs"
documents -- real documents, not uncrawlable 301/302 redirects.

At 10:14 PM 10/26/2000 -0500, Geoff Hutchison wrote:
>Sorry I didn't respond to this earlier--it's been a very busy week, but 
>I'm trying to catch up this weekend.
>
>At 12:41 PM -0500 10/26/00, Gilles Detillieux wrote:
>>The old record for the pre-redirect URL will get tossed out of the 
>>database by htmerge/htpurge, and the total index size should be corrected 
>>at that point.
>
>Furthermore, htsearch will ignore the document if for some reason it 
>hasn't been purged yet, at least in the 3.2 code.
>
>With the way the database worked in 3.1 and before, you *had* to change 
>DocIDs when you had a redirect. After all, the database was keyed by the 
>URL and you can't just change the key of the record in a B-Tree. Now the 
>main DocDB is keyed by DocID, but you still have to keep an index of 
>URL->DocID for htdig and you still don't want to change the key there.
>
>--
>-Geoff Hutchison
>Williams Students Online
>http://wso.williams.edu/


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 


Reply via email to