PDF Files, the neverending story......
First, check if the webserver generate the right header for PDF files. You can
check this if you download the file via your normal browser and open them with
the acroreader. If not insert
application/pdf pdf
in the mimetype file.
When this works edit the htdig.conf file and insert the following line:
pdf_parser: /usr/local/Acrobat4/bin/acroread -toPostScript -pairs
I use the acroreader (version 4.0) and not the xpdf tool to parse pdf documents
with htdig. You can download the software from www.adobe.com. Before digging
again test this at the command line!!
/path_to/acroread --toPostScript input_file.pdf
You`ll recieve a input_file.ps
When running htdig be sure that you have enough diskspace at /tmp .
"Joakim Wiberg (HMS)" schrieb:
> Hello,
>
> I try to index a PDF file and I get the following error.
>
> Read 8192 from document
> Read 1696 from document
> Read a total of 100000 bytes
> Can't determine type of file /usr/local/htdig/db/htdext.16478; content-type:
> application/pdf; URL: http://10.10.12.67/comp/datasheet/K00117.pdf
>
> I can get htdig to index common html pages, but when I try to index PDF
> files this problem arraises.
>
> /Joakim
>
> ------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> [EMAIL PROTECTED] containing the single word "unsubscribe" in
> the SUBJECT of the message.
--
Key fingerprint = 92 7D E0 A6 CF AE EC 32 14 28 EF 0D 57 2A 88 5B
----------------------------------------------------------------------
Preussag Noell Dienstleistungen
D-97080 Wuerzburg
Alfred-Nobel-Stra�e 20 Tel: +49 931 903-2243
Abt: DV-C/tr Fax: +49 511 903-2051
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.