The server on which the pre-1970 documents solution works perfect also uses the local_urls which makes it quite weird.
So if a new entry has 0 as modified date and the actual file is actually 01/01/1970 or earlier htdig would qualify it as not-modified.
Is there any way to fix this?
I use a PHP script for parsing the PDF documents (it uses pdfinfo / pdftotext for the actual exctraction) because of Oracle connectivity to retrieve the documents real title instead of the actual PDF file title.
It generates a HTML document which in his turn in read by htdig.
Would adding a last-modified entry make any difference although we use local files only ?
Regards,
Wim
According to Wim Kosten:
>> I've got this weird problem with ht://DIG 3.1.6
>>
>> I use htdig to index about 8000 PDF documents. In order to get the
>> correct ordering on dates (PDF files and normal HTML files) these files
>> are touched to a date as set in the database. So a document which
>> concerns a meeting of 2003-12-31 will be touched to that date.
>>
>> However, we have some documents which concern older stuff and the files
>> are touched to fe. 1954-08-03.
>>
>> As these files are on the same server htdig we use the local_urls
>> rewrite which works perfect.
>>
>> However when reading the local files which have a date before 1970 htdig
>> seems to see them as "not changed". I'm curious how htdig finds out
>> about that while before indexing the complete db directory was emptied
>> so it seems to me there's no reference to the file.
Around line 750 in htdig/Document.cc, in version 3.1.6, is this code:
modtime = stat_buf.st_mtime;
if (modtime <= date)
return Document_not_changed;So, if the local file's modified time is than the time in the database (which is 0, i.e. the UNIX epoch - 1 Jan 1970 00:00 UTC for a new entry), it's treated as not changed. That test should probably be != instead of <=, or there should be an exception made when date == 0.
>> Even stranger is the fact that I use this solution on another server and
>> it works perfect. I can't find any weird settings in the configs which
>> would lead to a fix.
On the other server, are you also using local_urls, or indexing via HTTP? Via HTTP, htdig doesn't use the If-Modified-Since header unless the database date is > 0. If you are using local_urls, it could be the other server deals with negative modtimes differently, e.g. if time_t is unsigned.
-- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This SF.net email is sponsored by: The Robotic Monkeys at ThinkGeek For a limited time only, get FREE Ground shipping on all orders of $35 or more. Hurry up and shop folks, this offer expires April 30th! http://www.thinkgeek.com/freeshipping/?cpg=12297 _______________________________________________ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev
------------------------------------------------------- This SF.net email is sponsored by: The Robotic Monkeys at ThinkGeek For a limited time only, get FREE Ground shipping on all orders of $35 or more. Hurry up and shop folks, this offer expires April 30th! http://www.thinkgeek.com/freeshipping/?cpg=12297 _______________________________________________ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev
