On Wed, 03 Jan 2001 15:59:57 +0100 Berthold Cogel
<[EMAIL PROTECTED]> wrote:
> Hello!
>
> I just tried to index our site with htdig-3.1.5 on a Sun UltraSparc with
> SunOS 5.7.
> To parse PDF documents I used doc2html and pdftotext. My first mistake
> was to leave max_doc_size at the default value. But I don't think that
> this was the reason for my problem:
>
> Sometimes doc2html hangs and eats resources and produces a unknown child
> process with <defunct> signature in the top list (perhaps pdftotext?).
>
There is a known bug in the hyphenation code in doc2html.pl
which causes it to loop indefinitely when parsing a .PDF
file when the last character is a hyphen. This
seems unlikely, but I have seen it.
In sub try_text change:
while (<CAT>) {
while ( m/[A-Za-z\300-\377]-\s*$/ && $set->{'hyph'}) {
($_ .= <CAT>) || last;
s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/s;
}
s/\255/-/g; # replace dashes with hyphens
To:
while (<CAT>) {
while ( m/[A-Za-z\300-\377]-\s*$/ && $set->{'hyph'}) {
$_ .= <CAT>;
last if eof;
s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/s;
}
s/\255/-/g; # replace dashes with hyphens
> I don't think that the document size is a reason for this effect,
> because some of the files that caused the trouble (last line in
> htdig.log) had a size of only 10 to 40 KByte. Some bigger files (up to
> 34 MByte) didn't stop doc2html.
>
> By the way: Where do I have to set $Verbose?
sub init {
# set = 1 for O/P on stderr if successful
$Verbose = 1;
Is it possible to write the
> messages of pdftotext and doc2html in a separate logfile?
>
Perhaps in the next version of doc2html.
> Why doesn't take htdig/doc2html the complete document for parsing. You
> only have to take max_doc_size into account when you take the parsed
> documents for indexing. This might reduce the problems with doctypes
> other than html or plain text.
max_doc_size affects all documents fetched by htdig. It is
a safety device to prevent the downloading of extremely
large (or infinitely long!) documents.
>
> Thanks in advance
>
> Berthold Cogel
>
> ------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> [EMAIL PROTECTED]
> You will receive a message to confirm this.
> List archives: <http://www.htdig.org/mail/menu.html>
> FAQ: <http://www.htdig.org/FAQ.html>
>
----------------------
David Adams
[EMAIL PROTECTED]
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives: <http://www.htdig.org/mail/menu.html>
FAQ: <http://www.htdig.org/FAQ.html>