Hello! I'm trying to index a news archive with htdig. I currently use a simple CGI script to convert the postings (plain text files) to some HTML pages. The script sets proper Expire: and Last-Modified: HTTP headers.
Another script gathers all postings from the archive and creates a HTML file with links pointing to all postings through the converter CGI script. htdig uses this HTML file to start the dig. It can simply find all postings. When I run htdig on the list file, it seems to touch every URL in every run. I already use a slightly modified version of the contributed rundig.sh script (the one that works with the -a switch on database copies for indexing while maintaining the ability to search). I want to tell htdig to revisit the URLs only after the sent 'Expire:' date instead of requesting them every run with an 'If-Modified-Since:' header. The archive contains about 400,000 postings and every request will start up a new CGI script -- slowing down indexing and driving the server load far over 1. Setting the server_wait_time to 1 fixes the load problem, but then it will take at least 400,000 seconds or about 111 hours (nearly 5 days) to index the archive. This means every new added during the dig will only be indexed in the next run after the current dig. So new postings can only be found after 5 to 10 days. I don't think this is a good solution. The postings in the archive seldom change. In fact, they never change, but some of them may be taken out of the archive for policy reasons from time to time. It would be fine if htdig would only access URL it did not touch yet (the postings new to the archive) or revisit already known ones only after their expiration time. The url_log seems not to be an solution to my problem. In the htdig archive, I read about the 'revisit-after' META flag. As far as I found out, it's not implemented (yet?). I could use the url_list feature, rotating the lists for n days and remove all URL contained in all those lists from the HTML file I mentioned above. This would be a work-around, but I want a more elegant solution. Any suggestions? Oh, maybe I should mention some information about the system: htdig version 3.1.6 Debian Sarge Pentium 4 2.4 GHz processor 512 MiB RAM Thanks in advance, Christoph ------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ ht://Dig general mailing list: <[email protected]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

