According to Joakim Wiberg (HMS):
> I try to index a PDF file and I get the following error.
>
> Read 8192 from document
> Read 1696 from document
> Read a total of 100000 bytes
> Can't determine type of file /usr/local/htdig/db/htdext.16478; content-type:
> application/pdf; URL: http://10.10.12.67/comp/datasheet/K00117.pdf
>
> I can get htdig to index common html pages, but when I try to index PDF
> files this problem arraises.
I can see a couple problems here. First of all, unless your K00117.pdf
is exactly 100000 bytes in length, it's being truncated. You'll likely
need to boost your max_doc_size attribute to something larger than your
biggest PDF to avoid truncation.
Secondly, the error message above seems to come from the parse_doc.pl
script. For some reason, your PDF does not have a magic number that the
script recognises, so it's rejecting it. Try running pdftotext on it
directly, to see if pdftotext can handle it. If that works, there's a
discrepancy between what pdftotext and parse_doc.pl recognise as a valid
PDF, and I'd probably need a sample of such a PDF to fix the problem
in parse_doc.pl. If pdftotext can't handle K00117.pdf directly, you're
not going to be able to index it in any case -- not with this external
parser anyway. In this case, you'll need to see what the problem is.
If acroread can handle the PDF, and pdftotext can't, I guess it's Derek
Noonberg's problem. Is the PDF encrypted or encoded in some way or other?
Gilles
P.S. I'll be away tomorrow, so I probably won't get to look into this
further until Monday.
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.