On Wed, 03 Jan 2001 15:59:57 +0100 Berthold Cogel 
<[EMAIL PROTECTED]> wrote:

> Hello!
> 
> I just tried to index our site with htdig-3.1.5 on a Sun UltraSparc with
> SunOS 5.7.
> To parse PDF documents I used doc2html and pdftotext. My first mistake
> was to leave max_doc_size at the default value. But I don't think that
> this was the reason for my problem:
> 
> Sometimes doc2html hangs and eats resources and produces a unknown child
> process with <defunct> signature in the top list (perhaps pdftotext?). 
> 

There is a known bug in the hyphenation code in doc2html.pl 
which causes it to loop indefinitely when parsing a .PDF 
file when the last character is a hyphen.  This 
seems unlikely, but I have seen it.

In sub try_text change:

      while (<CAT>) {
        while ( m/[A-Za-z\300-\377]-\s*$/ && $set->{'hyph'}) {
          ($_ .= <CAT>) || last;
          s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/s;
        }
        s/\255/-/g;     # replace dashes with hyphens

To:

     while (<CAT>) {
       while ( m/[A-Za-z\300-\377]-\s*$/ && $set->{'hyph'}) {
         $_ .= <CAT>;
         last if eof;
         s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/s;
       }
       s/\255/-/g;     # replace dashes with hyphens

> I don't think that the document size is a reason for this effect,
> because some of the files that caused the trouble (last line in
> htdig.log) had a size of only 10 to 40 KByte. Some bigger files (up to
> 34 MByte) didn't stop doc2html. 
> 
> By the way: Where do I have to set $Verbose?

sub init {

  # set = 1 for O/P on stderr if successful
  $Verbose = 1;

 Is it possible to write the
> messages of pdftotext and doc2html in a separate logfile?
> 

Perhaps in the next version of doc2html.

> Why doesn't take htdig/doc2html the complete document for parsing. You
> only have to take max_doc_size into account when you take the parsed
> documents for indexing. This might reduce the problems with doctypes
> other than html or plain text.

max_doc_size affects all documents fetched by htdig.  It is 
a safety device to prevent the downloading of extremely 
large (or infinitely long!) documents.

> 
> Thanks in advance
> 
> Berthold Cogel
> 
> ------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> [EMAIL PROTECTED]
> You will receive a message to confirm this.
> List archives:  <http://www.htdig.org/mail/menu.html>
> FAQ:            <http://www.htdig.org/FAQ.html>
> 

----------------------
David Adams
[EMAIL PROTECTED]


------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  <http://www.htdig.org/mail/menu.html>
FAQ:            <http://www.htdig.org/FAQ.html>

Reply via email to