[htdig] external_parsers bug (was Re: [htdig] pdf indexing problems)

Gilles Detillieux Thu, 16 Dec 2004 14:44:42 -0800

As a followup to the recent thread between Jon, David and Steve, I just
wanted to let you all know that I discovered a bug in the external_parsers
handling of htdig (versions 3.1.6 and 3.2.0b6).


Jon Sorensen reported verbose htdig output like this:
>     Content-Type: application/pdf
>     Header line:
>     returnStatus = 0
>     Read 8192 from document
>     Read 8192 from document
>     Read 8192 from document
>     Read 8192 from document
>     Read 907 from document
>     Read a total of 361355 bytes
>     word: [EMAIL PROTECTED]
>     word: [EMAIL PROTECTED]
>     word: [EMAIL PROTECTED]
>     word: [EMAIL PROTECTED]
>     word: [EMAIL PROTECTED]
>     word: [EMAIL PROTECTED]
>     word: [EMAIL PROTECTED]
>     word: [EMAIL PROTECTED]
>     word: [EMAIL PROTECTED]
>     word: [EMAIL PROTECTED]
>     word: [EMAIL PROTECTED]
>     word: [EMAIL PROTECTED]

I've seen that before in posts to htdig-general, but couldn't make sense
of that.

Jon also asked:
>     I posted a question recently about indexing pdfs with doc2html
>     but I can't figure out what the problem is. I believe that the conifg is 
> correct
>     but there may be a problem there. when I dig a number of pdfs the files
>     are read but the words indexed are not correct:
>     word: [EMAIL PROTECTED]
>     word: [EMAIL PROTECTED]
>     word: [EMAIL PROTECTED]
>     Does anyone know what this indicates?
>     From looking at the message archives it seems that others have had this 
> problem
>     but there weren't any solutions posted in the messages

It appears that htdig's stdout is being fed back into the parser, which
seemed to defy all logic, until I figured out the cause on a new test
system, which was also having problems indexing PDFs.  When I ran the
external converter manually, I got the error:

/usr/local/bin/perl: bad interpreter: No such file or directory

The problem was that the script began with "#!/usr/local/bin/perl",
which worked fine on the older system, but not on the newer one.
That explained why PDF indexing didn't work (htdig couldn't "exec"
the external_parsers script), but not why htdig was eating its own output.

Then I realized what was going on:  htdig does a fork() and execv()
to call the script, and if the execv() fails the child process exits,
as it should.  But, the child process exits using the exit() function,
rather than _exit(), which is a no-no in a child process.  The problem
is that the fork() makes a duplicate of everything in the parent
process, including all the parent's I/O buffers.  If the child process
calls exit(), it flushes its copy of the parent's stdout buffer, so a
copy of much of the parent's verbose output gets flushed out into the
child's pipe, which the parent reads and parses.  The fix is to change
htdig/ExternalParser.cc like this:

--- htdig/ExternalParser.cc.orig        2004-05-28 08:15:14.000000000 -0500
+++ htdig/ExternalParser.cc     2004-12-16 16:37:14.000000000 -0600
@@ -280,7 +280,11 @@ ExternalParser::parse(Retriever &retriev
        // Call External Parser
        execv(parsargs[0], parsargs);
 
-       exit(EXIT_FAILURE);
+       perror("execv");
+       write(STDERR_FILENO, "External parser error: Can't execute ", 37);
+       write(STDERR_FILENO, parseargs[0], strlen(parseargs[0]));
+       write(STDERR_FILENO, "\n", 1);
+       _exit(EXIT_FAILURE);
     }
 
     // Parent Process

Of course, this is only a problem if the external parser/converter script
can't be exec'ed by htdig, so if all is working well, this bug won't be
an issue.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

[htdig] external_parsers bug (was Re: [htdig] pdf indexing problems)

Reply via email to