Re: [htdig] Revisit pages only after Expire: date?

Neal Richter Thu, 15 Jun 2006 09:52:49 -0700

This is a good enhancement request and we'll work on it for HtDig 4.0

Thanks

On 4/21/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

Christoph,
I'm afraid I can't see a single simple answer to what you are trying to
do, but I'm not 100% sure I understand you correctly.

If you are using the -a  option, then you cannot possibly use the native
'If-Modified-Since:' function. However, it might be possible to create a
duplicate copy of your databases, either in a different folder, or with
a different prefix, and run an update dig on the offline set of
databases.

But, from what you describe, that may not help. The
'If-Modified-Since:'  feature does require htdig to send a request to
the server, and from what you say this may still require the script to
fire up, if only to return a 'not-modified' header.

If you have the relevant coding expertise, then you can of course modify
the source code to do what you describe, but the simplest way of doing
this is probably going to be to do an update dig, on an offline copy of
your databases, using a URL-list as your starting point. That list would
then need to be generated by an external script, possibly utilising an
'External Parser' to note the Expire: header details.

Hope that helps,
Mike

> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto: [EMAIL PROTECTED]] On Behalf
> Of Christoph 'Mehdorn' Weber
> Sent: Thursday, April 20, 2006 10:05 PM
> To: [email protected]
> Subject: [htdig] Revisit pages only after Expire: date?
>
> Hello!
>
>   I'm trying to index a news archive with htdig. I currently use a
> simple CGI script to convert the postings (plain text files) to some
> HTML pages. The script sets proper Expire: and Last-Modified: HTTP
> headers.
>
>   Another script gathers all postings from the archive and creates a
> HTML file with links pointing to all postings through the
> converter CGI
> script. htdig uses this HTML file to start the dig. It can simply find
> all postings.
>
>   When I run htdig on the list file, it seems to touch every URL in
> every run. I already use a slightly modified version of the
> contributed
> rundig.sh script (the one that works with the -a switch on database
> copies for indexing while maintaining the ability to search).
>
>   I want to tell htdig to revisit the URLs only after the
> sent 'Expire:'
> date instead of requesting them every run with an 'If-Modified-Since:'
> header. The archive contains about 400,000 postings and every request
> will start up a new CGI script -- slowing down indexing and
> driving the
> server load far over 1.
>
>   Setting the server_wait_time to 1 fixes the load problem,
> but then it
> will take at least 400,000 seconds or about 111 hours (nearly
> 5 days) to
> index the archive. This means every new added during the dig will only
> be indexed in the next run after the current dig. So new postings can
> only be found after 5 to 10 days. I don't think this is a
> good solution.
>
>
>   The postings in the archive seldom change. In fact, they
> never change,
> but some of them may be taken out of the archive for policy
> reasons from
> time to time.
>
>   It would be fine if htdig would only access URL it did not touch yet
> (the postings new to the archive) or revisit already known ones only
> after their expiration time.
>
>   The url_log seems not to be an solution to my problem. In the htdig
> archive, I read about the 'revisit-after' META flag. As far as I found
> out, it's not implemented (yet?).
>
>   I could use the url_list feature, rotating the lists for n days and
> remove all URL contained in all those lists from the HTML file I
> mentioned above. This would be a work-around, but I want a
> more elegant
> solution.
>
>   Any suggestions?
>
>   Oh, maybe I should mention some information about the system:
> htdig version 3.1.6
> Debian Sarge
> Pentium 4 2.4 GHz processor
> 512 MiB RAM
>
> Thanks in advance,
> Christoph
>
>
> -------------------------------------------------------
> Using Tomcat but need to do more? Need to support web
> services, security?
> Get stuff done quickly with pre-integrated technology to make
> your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on
> Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&
> dat=121642
> _______________________________________________
> ht://Dig general mailing list: <[email protected]>
> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
> List information (subscribe/unsubscribe, etc.)
> https://lists.sourceforge.net/lists/listinfo/htdig-general
>

-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmdlnk&kid0709&bid&3057&dat1642
_______________________________________________
ht://Dig general mailing list: <[email protected]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

_______________________________________________
ht://Dig general mailing list: <[email protected]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Re: [htdig] Revisit pages only after Expire: date?

Reply via email to