Giles,
Congratulations on getting to the bottom of this; it solves a few mysterious reports of difficulties with external parsing reported to the mailing list in the last couple of years.
Could I clarify one point with you? You wrote:
The problem was that the script began with "#!/usr/local/bin/perl", which worked fine on the older system, but not on the newer one.
was this simply because the Perl binary was in a different location on your newer system, or because there is a general problem on the newer system with this method of specifying the executable to run the script (which would be serious indeed!)?
I think the moral is that users must take great care in correctly configuring their external parser(s), and must check that they work from the command line.
David Adams Corporate Information Services Information Systems Services University of Southampton
----- Original Message ----- From: "Gilles Detillieux" <[EMAIL PROTECTED]>
To: "ht://Dig mailing list" <[EMAIL PROTECTED]>
Cc: "Gilles Detillieux" <[EMAIL PROTECTED]>; "Gilbert Detillieux" <[EMAIL PROTECTED]>
Sent: Thursday, December 16, 2004 10:40 PM
Subject: [htdig] external_parsers bug (was Re: [htdig] pdf indexing problems)
As a followup to the recent thread between Jon, David and Steve, I just wanted to let you all know that I discovered a bug in the external_parsers handling of htdig (versions 3.1.6 and 3.2.0b6).
Jon Sorensen reported verbose htdig output like this:Content-Type: application/pdf Header line: returnStatus = 0 Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 907 from document Read a total of 361355 bytes word: [EMAIL PROTECTED] word: [EMAIL PROTECTED] word: [EMAIL PROTECTED] word: [EMAIL PROTECTED] word: [EMAIL PROTECTED] word: [EMAIL PROTECTED] word: [EMAIL PROTECTED] word: [EMAIL PROTECTED] word: [EMAIL PROTECTED] word: [EMAIL PROTECTED] word: [EMAIL PROTECTED] word: [EMAIL PROTECTED]
I've seen that before in posts to htdig-general, but couldn't make sense of that.
Jon also asked:I posted a question recently about indexing pdfs with doc2html
but I can't figure out what the problem is. I believe that the conifg is correct
but there may be a problem there. when I dig a number of pdfs the files
are read but the words indexed are not correct:
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
Does anyone know what this indicates?
From looking at the message archives it seems that others have had this problem
but there weren't any solutions posted in the messages
It appears that htdig's stdout is being fed back into the parser, which seemed to defy all logic, until I figured out the cause on a new test system, which was also having problems indexing PDFs. When I ran the external converter manually, I got the error:
/usr/local/bin/perl: bad interpreter: No such file or directory
The problem was that the script began with "#!/usr/local/bin/perl", which worked fine on the older system, but not on the newer one. That explained why PDF indexing didn't work (htdig couldn't "exec" the external_parsers script), but not why htdig was eating its own output.
Then I realized what was going on: htdig does a fork() and execv() to call the script, and if the execv() fails the child process exits, as it should. But, the child process exits using the exit() function, rather than _exit(), which is a no-no in a child process. The problem is that the fork() makes a duplicate of everything in the parent process, including all the parent's I/O buffers. If the child process calls exit(), it flushes its copy of the parent's stdout buffer, so a copy of much of the parent's verbose output gets flushed out into the child's pipe, which the parent reads and parses. The fix is to change htdig/ExternalParser.cc like this:
--- htdig/ExternalParser.cc.orig 2004-05-28 08:15:14.000000000 -0500 +++ htdig/ExternalParser.cc 2004-12-16 16:37:14.000000000 -0600 @@ -280,7 +280,11 @@ ExternalParser::parse(Retriever &retriev // Call External Parser execv(parsargs[0], parsargs);
- exit(EXIT_FAILURE); + perror("execv"); + write(STDERR_FILENO, "External parser error: Can't execute ", 37); + write(STDERR_FILENO, parseargs[0], strlen(parseargs[0])); + write(STDERR_FILENO, "\n", 1); + _exit(EXIT_FAILURE); }
// Parent Process
Of course, this is only a problem if the external parser/converter script can't be exec'ed by htdig, so if all is working well, this bug won't be an issue.
-- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada)
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/ _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

