According to Derek B. Noonburg:
> I checked this first one.  Xpdf 0.80 doesn't have any trouble
> displaying it (on my Linux box).  It's just scanned images, one per
> page, so pdftotext isn't going to get anything.
> 
> I'm planning to add PDF 1.3 support, but it doesn't look like there are
> too many major differences, so xpdf 0.80 should do ok for now.
> 
> As for the 'file is damaged' error, maybe you got a bad file download? 
> For example, I've seen cases where (flaky) web servers die in the middle
> of a transfer, with no visible error.  Your error message is consistent
> with what I'd expect for a truncated PDF file.

D'oh!  The "file is damaged" error should have tweaked my memory.  It's
come up before, but I got thrown off track by the version number issue.

The max_doc_size attribute tells htdig what it should use as an upper
limit on documents it fetches.  Anything above that gets truncated!
This works OK for HTML documents, but it makes PDFs unusable.
The default max_doc_size is 100000 bytes.  When indexing PDFs, this
should be increased by a lot, so that it's big enough to handle the
largest PDF you will index.  If you can't afford to make it large enough,
because of memory constraints, you need to explicitly exclude larger
PDFs from indexing, e.g. by listing them with Disallow records in your
robots.txt file.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to