>
> I just discovered that max_doc_size is different from max_head_length.
> Furthermore the default for max_doc_size is 100K (defaults.cc). This is
> fine except when indexing large PDF files. The problem is that the
> error message is not correct. I got many errors like this (with only
> one -v):
>
> /tmp/htdig29740.pdf: Could not repair file.
> /tmp/htdig29740.pdf: Could not repair file.
> /tmp/htdig29740.pdf: Could not repair file.
> /tmp/htdig29740.pdf: Could not repair file.
> /tmp/htdig29740.pdf: Could not repair file.
> PDF::parse: cannot open acroread output
> PDF::parse: cannot open acroread output
> PDF::parse: cannot open acroread output
> PDF::parse: cannot open acroread output
> PDF::parse: cannot open acroread output
>
> Repeated many times...
>
> 'Could not repair file' made me think that there was a problem with some
> of my pdf files or with my acroread program. However, the error should
> have said something like this:
>
> Document.cc: /tmp/htdig29740.ps: file is too large
<SNIP>
Hi,
I have just had to solve this problem myself. As I understand it, the 'Can't
repair file' message is coming from Acroread. The sequence of events is
something like this (someone correct me if I am wrong:-):
1 htdig, copies no more than max_doc_size bytes of the .pdf file to a tempory
file
2 htdig then fires up acroread and passes the temporary file name
3 If the file was bigger than max_doc_size bytes, Acroread encounters an
unexpected EOF, and assumes that the .pdf file is corrupt, hence the error
4 Acroread returns with an error
5 htdig reports the 'Can't read acroread output' error
It was a little confusing to start with, but it is dealt with in the FAQ.
Fare Thee Well
Anthony Peacock
CHIME, UCL Medical School
E-Mail: [EMAIL PROTECTED]
WWW: http://www.chime.ucl.ac.uk/~rmhiajp/
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.