Re: [htdig] pdf indexing question

Gilles Detillieux Tue, 25 Jul 2000 09:23:38 -0700
According to Matthew R. MacIntyre:
> I'm having a problem indexing pdf files.  The htdig phase seems to work
> fine, no errors are produced, but when the htmerge phase is run, this error
> always shows up:
> 
> Deleted, no excerpt: 17/http://svr-newlix/products/technical/faq.pdf
> 
> I'm not really sure how to go about fixing this problem.  Here's what I have
> in my configuration file:
> 
> external_parsers: application/msword->text/html /usr/local/htdig/bin/conv_doc.pl \
>                application/postscript->text/html /usr/local/htdig/bin/conv_doc.pl \
>                application/pdf->text/html /usr/local/htdig/bin/conv_doc.pl
> 
> I was trying to use the parse_doc.pl script instead of the conv_doc.pl
> script for a little while, but I kept getting many errors about acroread not
> showing up, and how the pdf files could not be repaired.

Looks like you're dealing with a few separate problems here.

Errors about acroread not being found shouldn't happen if you properly
configure an external parser or converter for application/pdf, so you
had a configuration error somewhere when trying to use parse_doc.pl.
As long as you're running 3.1.4 or later, you should use conv_doc.pl or
doc2html.pl, rather than parse_doc.pl -- they just work better.

Also, errors about PDF files that couldn't be repaired would come from
acroread as well.  These are caused by max_doc_size not being set high
enough for your largest PDF documents.  See FAQ 5.1 & 5.2.

Finally, you should run /usr/local/htdig/bin/conv_doc.pl, and perhaps
pdftotext, manually on your products/technical/faq.pdf document to
see what output you get, if any.  It may be that the PDF contains only
image data, and no indexable text, or it may be that conv_doc.pl isn't
configured with the right path to the pdftotext executable.

I'm assuming the first two lines of your external_parsers definition
above were split up by your mail program (I rejoined them above), and
they aren't split in your configuration file.  A backslash is required
at the very end of all but the last line in a multi-line definition.

If you can make sure that your external_parsers definition is correct,
that max_doc_size is big enough for your PDFs, that running conv_doc.pl
on your PDFs does produce indexable text, and that the PDFs are not
disallowed by your robots.txt file, then you shouldn't get the no excerpt
error above.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
Re: [htdig] pdf indexing question

Reply via email to