[htdig] Revisit pages only after Expire: date?

Christoph 'Mehdorn' Weber Thu, 20 Apr 2006 14:17:06 -0700

Hello!

  I'm trying to index a news archive with htdig. I currently use a
simple CGI script to convert the postings (plain text files) to some
HTML pages. The script sets proper Expire: and Last-Modified: HTTP
headers.


  Another script gathers all postings from the archive and creates a
HTML file with links pointing to all postings through the converter CGI
script. htdig uses this HTML file to start the dig. It can simply find
all postings.

  When I run htdig on the list file, it seems to touch every URL in
every run. I already use a slightly modified version of the contributed
rundig.sh script (the one that works with the -a switch on database
copies for indexing while maintaining the ability to search).

  I want to tell htdig to revisit the URLs only after the sent 'Expire:'
date instead of requesting them every run with an 'If-Modified-Since:'
header. The archive contains about 400,000 postings and every request
will start up a new CGI script -- slowing down indexing and driving the
server load far over 1.

  Setting the server_wait_time to 1 fixes the load problem, but then it
will take at least 400,000 seconds or about 111 hours (nearly 5 days) to
index the archive. This means every new added during the dig will only
be indexed in the next run after the current dig. So new postings can
only be found after 5 to 10 days. I don't think this is a good solution.


  The postings in the archive seldom change. In fact, they never change,
but some of them may be taken out of the archive for policy reasons from
time to time.

  It would be fine if htdig would only access URL it did not touch yet
(the postings new to the archive) or revisit already known ones only
after their expiration time.

  The url_log seems not to be an solution to my problem. In the htdig
archive, I read about the 'revisit-after' META flag. As far as I found
out, it's not implemented (yet?).

  I could use the url_list feature, rotating the lists for n days and
remove all URL contained in all those lists from the HTML file I
mentioned above. This would be a work-around, but I want a more elegant
solution.

  Any suggestions?

  Oh, maybe I should mention some information about the system:
htdig version 3.1.6
Debian Sarge
Pentium 4 2.4 GHz processor
512 MiB RAM

Thanks in advance,
Christoph


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
ht://Dig general mailing list: <[email protected]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

[htdig] Revisit pages only after Expire: date?

Reply via email to