Re: [htdig] Problem indexing PDF files

Gilles Detillieux Thu, 8 Jul 1999 09:10:26 -0700

According to Joakim Wiberg (HMS):
> I try to index a PDF file and I get the following error.
> 
> Read 8192 from document
> Read 1696 from document
> Read a total of 100000 bytes
> Can't determine type of file /usr/local/htdig/db/htdext.16478; content-type:
> application/pdf; URL: http://10.10.12.67/comp/datasheet/K00117.pdf
> 
> I can get htdig to index common html pages, but when I try to index PDF
> files this problem arraises.

I can see a couple problems here.  First of all, unless your K00117.pdf
is exactly 100000 bytes in length, it's being truncated.  You'll likely
need to boost your max_doc_size attribute to something larger than your
biggest PDF to avoid truncation.

Secondly, the error message above seems to come from the parse_doc.pl
script.  For some reason, your PDF does not have a magic number that the
script recognises, so it's rejecting it.  Try running pdftotext on it
directly, to see if pdftotext can handle it.  If that works, there's a
discrepancy between what pdftotext and parse_doc.pl recognise as a valid
PDF, and I'd probably need a sample of such a PDF to fix the problem
in parse_doc.pl.  If pdftotext can't handle K00117.pdf directly, you're
not going to be able to index it in any case -- not with this external
parser anyway.  In this case, you'll need to see what the problem is.
If acroread can handle the PDF, and pdftotext can't, I guess it's Derek
Noonberg's problem.  Is the PDF encrypted or encoded in some way or other?

Gilles

P.S.  I'll be away tomorrow, so I probably won't get to look into this
further until Monday.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.
Re: [htdig] Problem indexing PDF files

Reply via email to