Re: [htdig] doc2html.pl version 3 problems under windows NT

Gilles Detillieux Mon, 18 Jun 2001 11:58:19 -0700
According to Marcus Valentine:
> Now htdig runs, with no errors when it encounters a pdf.  For example
> 
> 15:15:1:http://marcusv_pc:8080/toracomm/pdf/DS012_Design_Services.pdf:
> size = 69129
> 
> But when I run htmerge, I get for example 
> 
> Deleted, no excerpt:
> 15/http://marcusv_pc:8080/toracomm/pdf/DS012_Design_Services.pdf
> 
> Is the pdf being indexed or not?  Anyone got any ideas?

If a file is deleted because of "no excerpt", there are a few possible
reasons.  One is that the file contains no indexable text in it.  Try
running doc2html.pl on it directly and see what comes out.  If it looks
like a valid HTML file with lots of text, then it could be another reason.
If it doesn't have text, try running pdftotext on it directly, to confirm
the file has no text (as opposed to a problem with the doc2html.pl script).
Some PDFs look like they contain text when you view them, but the text is
really just images, not ASCII text.

Other reasons include the file being listed in robots.txt, containing
a robots meta tag to turn off indexing, or hitting the server_max_docs
limit.  I don't think any of these apply here, given that doc2html
doesn't spit out meta robots tags and the "size =" line indicates the
document is actually retrieved.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html
Re: [htdig] doc2html.pl version 3 problems under windows NT

Reply via email to