> 
> I just discovered that max_doc_size is different from max_head_length. 
> Furthermore the default for max_doc_size is 100K (defaults.cc).  This is
> fine except when indexing large PDF files.  The problem is that the
> error message is not correct.  I got many errors like this (with only
> one -v):
> 
> /tmp/htdig29740.pdf: Could not repair file.
> /tmp/htdig29740.pdf: Could not repair file.
> /tmp/htdig29740.pdf: Could not repair file.
> /tmp/htdig29740.pdf: Could not repair file.
> /tmp/htdig29740.pdf: Could not repair file.
> PDF::parse: cannot open acroread output
> PDF::parse: cannot open acroread output
> PDF::parse: cannot open acroread output
> PDF::parse: cannot open acroread output
> PDF::parse: cannot open acroread output
> 
> Repeated many times...
> 
> 'Could not repair file' made me think that there was a problem with some
> of my pdf files or with my acroread program.  However, the error should
> have said something like this:
> 
> Document.cc: /tmp/htdig29740.ps: file is too large

<SNIP>

Hi,

I have just had to solve this problem myself.  As I understand it, the 'Can't 
repair file' message is coming from Acroread.  The sequence of events is 
something like this (someone correct me if I am wrong:-):

1  htdig, copies no more than max_doc_size bytes of the .pdf file to a tempory 
    file 

2  htdig then fires up acroread and passes the temporary file name

3  If the file was bigger than max_doc_size bytes, Acroread encounters an 
   unexpected EOF, and assumes that the .pdf file is corrupt, hence the error

4  Acroread returns with an error

5  htdig reports the 'Can't read acroread output' error

It was a little confusing to start with, but it is dealt with in the FAQ.



Fare Thee Well
Anthony Peacock       
CHIME, UCL Medical School
E-Mail: [EMAIL PROTECTED]
WWW:    http://www.chime.ucl.ac.uk/~rmhiajp/
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.

Reply via email to