According to T.J. Yang:
> Geoff Hutchison wrote:
> > On Fri, 2 Mar 2001, T.J. Yang wrote:
> > > It is possible to configure htdig to scan only the file
> > > being modified since last scan ?
> > >
> > > Regularly scannig the same old doucuments seems to me is a waste of
> > > time.
> >
> > Ah, but here's the catch. How do you know which documents have been
> > modified without querying the server? For the most part, htdig uses the
> > efficient If-Modified-Since header, which requires the server to only send
> > the document if it has changed.
>
> I am using htdig to index a large amount of PDF/PS files under a
> direcotry.
> There PDF/ps files don't change much but there are big files. I was
> hoping
> perhaps some I can configure the htdig to do a file date comparsion
> against
> a local database contains the records of all the web pages/files last
> visited
> before parsing the contents of PDF/PS files.
>
> Perhaps this should be a wish feature for the future release of htdig.
This is currently the default behaviour of htdig. If it's not working
for you, there must be a configuration problem of some sort.
If your HTTP server correctly returns modification times for static
documents, htdig will record these in its database. When you re-run
htdig later, without the -i option, it will ask the server again for
these documents, but using the If-Modified-Since header so that the
server will only return the documents if they have been modified.
Even if the server ignores that header, htdig will check the returned
documents new Last-Modified header, and if this date is the same as it
is in the database, it will not bother reindexing the document (although
at this point it has fetched the document).
If htdig is giving you a lot of "retrieved but not changed" messages,
it is because your HTTP server is either ignoring the If-Modified-Since
header, or it's not returning Last-Modified headers and your htdig.conf
modification_time_is_now attribute is set to false. If htdig is simply
reporting "not changed", it is working correctly. If it is reporting
neither, chances are your HTTP server is not returning Last-Modified
headers and your htdig.conf modification_time_is_now attribute is set
to true. (This latter behaviour is actually preferable to it assuming
the document is "retrieved but not changed" when your server is not
returning Last-Modified headers, because at least you know you'll always
have the most up-to-date copy in the index.)
If your HTTP server is not returning Last-Modified headers, it's either
misconfigured, or it is treating your files as dynamic content for which
it cannot determine the modification date. Dynamic content would include
output from CGI programs, SSI (e.g. .shtml files), PHP, ASP and .cfm,
but it seems odd that it would treat PS/PDF files as dynamic content.
If you run htdig with -vvv as options, you'll see what headers are passed
to and from the server.
If using local_urls is an option for you, you could side-step the whole
HTTP server and make indexing much quicker and more efficient.
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/htdig-dev