I just discovered that max_doc_size is different from max_head_length. 
Furthermore the default for max_doc_size is 100K (defaults.cc).  This is
fine except when indexing large PDF files.  The problem is that the
error message is not correct.  I got many errors like this (with only
one -v):

/tmp/htdig29740.pdf: Could not repair file.
/tmp/htdig29740.pdf: Could not repair file.
/tmp/htdig29740.pdf: Could not repair file.
/tmp/htdig29740.pdf: Could not repair file.
/tmp/htdig29740.pdf: Could not repair file.
PDF::parse: cannot open acroread output
PDF::parse: cannot open acroread output
PDF::parse: cannot open acroread output
PDF::parse: cannot open acroread output
PDF::parse: cannot open acroread output

Repeated many times...

'Could not repair file' made me think that there was a problem with some
of my pdf files or with my acroread program.  However, the error should
have said something like this:

Document.cc: /tmp/htdig29740.ps: file is too large

Here is a sample of the debugging output (several -v's) that
demonstrates what is displayed when a document is truncated:

8:8:1:http://www.et.byu.edu/caedm/software/misc/undergrad_cat.pdf:
/tmp/htdig21961.pdf: Could not repair file.
PDF::parse: cannot open acroread output
 size = 1998848

I believe that PDF::parse still indexes the first part of the file when
it is too long.  I was unable to locate who is generating the 'Could not
repair file' message.  Since PDF::parse is not the problem here, perhaps
this message should not be displayed with only one -v.  Unless I missed
something...

Gordon
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.

Reply via email to