RE: [htdig] pdf parser

Martin Vorlaender Mon, 13 Jan 2003 23:21:34 -0800

Robert Isaac <[EMAIL PROTECTED]> wrote (via email):
>>> I appear to have the pdf parser working partly. Some pdf files are
>>> indexed but most are not. I am using xpdf and doc2html programmes.
>>> Is there a reason why this should happen?
>>
>> Assuming you're talking about ht://Dig 3.1.x:
>>
>> The reason that gets most people is that any document bigger than
>> max_doc_size will not be retrieved completely, and thus not indexed
>> completely.
>>
>> See http://www.htdig.org/attrs.html#max_doc_size
>
> Thank you for your message. I had increased the max_doc_size 
> to 5000000, and it is ver 3.1.6. I have over 100 pdf files on
> the web site, and only 2 have been indexed during rundig.


Next I'd suggest you run rundig with multiple -v's and output
redirection, and have a look at any error messages in the logfile
generated.

The most simplistic error of course would be that the PDFs are not
linked to (i.e. reachable from) any of the start_url's.

Also, I seem to remember a note (in the sources?) that xpdf wouldn't
work. Could someone else please chime in here?

cu,
  Martin
-- 
                           | Martin Vorlaender       VMS/WNT programmer
 Unix is user friendly.    | work: [EMAIL PROTECTED]
 It's just selective about |   http://www.pdv-systeme.de/users/martinv/
 who his friends are.      | home: [EMAIL PROTECTED]


-------------------------------------------------------
This SF.NET email is sponsored by: FREE  SSL Guide from Thawte
are you planning your Web Server Security? Click here to get a FREE
Thawte SSL guide and find the answers to all your  SSL security issues.
http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0026en
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

RE: [htdig] pdf parser

Reply via email to