As a followup to the recent thread between Jon, David and Steve, I just
wanted to let you all know that I discovered a bug in the external_parsers
handling of htdig (versions 3.1.6 and 3.2.0b6).
Jon Sorensen reported verbose htdig output like this:
> Content-Type: application/pdf
> Header line:
> returnStatus = 0
> Read 8192 from document
> Read 8192 from document
> Read 8192 from document
> Read 8192 from document
> Read 907 from document
> Read a total of 361355 bytes
> word: [EMAIL PROTECTED]
> word: [EMAIL PROTECTED]
> word: [EMAIL PROTECTED]
> word: [EMAIL PROTECTED]
> word: [EMAIL PROTECTED]
> word: [EMAIL PROTECTED]
> word: [EMAIL PROTECTED]
> word: [EMAIL PROTECTED]
> word: [EMAIL PROTECTED]
> word: [EMAIL PROTECTED]
> word: [EMAIL PROTECTED]
> word: [EMAIL PROTECTED]
I've seen that before in posts to htdig-general, but couldn't make sense
of that.
Jon also asked:
> I posted a question recently about indexing pdfs with doc2html
> but I can't figure out what the problem is. I believe that the conifg is
> correct
> but there may be a problem there. when I dig a number of pdfs the files
> are read but the words indexed are not correct:
> word: [EMAIL PROTECTED]
> word: [EMAIL PROTECTED]
> word: [EMAIL PROTECTED]
> Does anyone know what this indicates?
> From looking at the message archives it seems that others have had this
> problem
> but there weren't any solutions posted in the messages
It appears that htdig's stdout is being fed back into the parser, which
seemed to defy all logic, until I figured out the cause on a new test
system, which was also having problems indexing PDFs. When I ran the
external converter manually, I got the error:
/usr/local/bin/perl: bad interpreter: No such file or directory
The problem was that the script began with "#!/usr/local/bin/perl",
which worked fine on the older system, but not on the newer one.
That explained why PDF indexing didn't work (htdig couldn't "exec"
the external_parsers script), but not why htdig was eating its own output.
Then I realized what was going on: htdig does a fork() and execv()
to call the script, and if the execv() fails the child process exits,
as it should. But, the child process exits using the exit() function,
rather than _exit(), which is a no-no in a child process. The problem
is that the fork() makes a duplicate of everything in the parent
process, including all the parent's I/O buffers. If the child process
calls exit(), it flushes its copy of the parent's stdout buffer, so a
copy of much of the parent's verbose output gets flushed out into the
child's pipe, which the parent reads and parses. The fix is to change
htdig/ExternalParser.cc like this:
--- htdig/ExternalParser.cc.orig 2004-05-28 08:15:14.000000000 -0500
+++ htdig/ExternalParser.cc 2004-12-16 16:37:14.000000000 -0600
@@ -280,7 +280,11 @@ ExternalParser::parse(Retriever &retriev
// Call External Parser
execv(parsargs[0], parsargs);
- exit(EXIT_FAILURE);
+ perror("execv");
+ write(STDERR_FILENO, "External parser error: Can't execute ", 37);
+ write(STDERR_FILENO, parseargs[0], strlen(parseargs[0]));
+ write(STDERR_FILENO, "\n", 1);
+ _exit(EXIT_FAILURE);
}
// Parent Process
Of course, this is only a problem if the external parser/converter script
can't be exec'ed by htdig, so if all is working well, this bug won't be
an issue.
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada)
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://productguide.itmanagersjournal.com/
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general