Giles,

Congratulations on getting to the bottom of this; it solves a few mysterious reports of difficulties with external parsing reported to the mailing list in the last couple of years.

Could I clarify one point with you?  You wrote:

The problem was that the script began with "#!/usr/local/bin/perl",
which worked fine on the older system, but not on the newer one.

was this simply because the Perl binary was in a different location on your newer system, or because there is a general problem on the newer system with this method of specifying the executable to run the script (which would be serious indeed!)?


I think the moral is that users must take great care in correctly configuring their external parser(s), and must check that they work from the command line.

David Adams
Corporate Information Services
Information Systems Services
University of Southampton

----- Original Message ----- From: "Gilles Detillieux" <[EMAIL PROTECTED]>
To: "ht://Dig mailing list" <[EMAIL PROTECTED]>
Cc: "Gilles Detillieux" <[EMAIL PROTECTED]>; "Gilbert Detillieux" <[EMAIL PROTECTED]>
Sent: Thursday, December 16, 2004 10:40 PM
Subject: [htdig] external_parsers bug (was Re: [htdig] pdf indexing problems)



As a followup to the recent thread between Jon, David and Steve, I just
wanted to let you all know that I discovered a bug in the external_parsers
handling of htdig (versions 3.1.6 and 3.2.0b6).

Jon Sorensen reported verbose htdig output like this:
    Content-Type: application/pdf
    Header line:
    returnStatus = 0
    Read 8192 from document
    Read 8192 from document
    Read 8192 from document
    Read 8192 from document
    Read 907 from document
    Read a total of 361355 bytes
    word: [EMAIL PROTECTED]
    word: [EMAIL PROTECTED]
    word: [EMAIL PROTECTED]
    word: [EMAIL PROTECTED]
    word: [EMAIL PROTECTED]
    word: [EMAIL PROTECTED]
    word: [EMAIL PROTECTED]
    word: [EMAIL PROTECTED]
    word: [EMAIL PROTECTED]
    word: [EMAIL PROTECTED]
    word: [EMAIL PROTECTED]
    word: [EMAIL PROTECTED]

I've seen that before in posts to htdig-general, but couldn't make sense of that.

Jon also asked:
I posted a question recently about indexing pdfs with doc2html
but I can't figure out what the problem is. I believe that the conifg is correct
but there may be a problem there. when I dig a number of pdfs the files
are read but the words indexed are not correct:
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
Does anyone know what this indicates?
From looking at the message archives it seems that others have had this problem
but there weren't any solutions posted in the messages

It appears that htdig's stdout is being fed back into the parser, which seemed to defy all logic, until I figured out the cause on a new test system, which was also having problems indexing PDFs. When I ran the external converter manually, I got the error:

/usr/local/bin/perl: bad interpreter: No such file or directory

The problem was that the script began with "#!/usr/local/bin/perl",
which worked fine on the older system, but not on the newer one.
That explained why PDF indexing didn't work (htdig couldn't "exec"
the external_parsers script), but not why htdig was eating its own output.

Then I realized what was going on:  htdig does a fork() and execv()
to call the script, and if the execv() fails the child process exits,
as it should.  But, the child process exits using the exit() function,
rather than _exit(), which is a no-no in a child process.  The problem
is that the fork() makes a duplicate of everything in the parent
process, including all the parent's I/O buffers.  If the child process
calls exit(), it flushes its copy of the parent's stdout buffer, so a
copy of much of the parent's verbose output gets flushed out into the
child's pipe, which the parent reads and parses.  The fix is to change
htdig/ExternalParser.cc like this:

--- htdig/ExternalParser.cc.orig 2004-05-28 08:15:14.000000000 -0500
+++ htdig/ExternalParser.cc 2004-12-16 16:37:14.000000000 -0600
@@ -280,7 +280,11 @@ ExternalParser::parse(Retriever &retriev
 // Call External Parser
 execv(parsargs[0], parsargs);

- exit(EXIT_FAILURE);
+ perror("execv");
+ write(STDERR_FILENO, "External parser error: Can't execute ", 37);
+ write(STDERR_FILENO, parseargs[0], strlen(parseargs[0]));
+ write(STDERR_FILENO, "\n", 1);
+ _exit(EXIT_FAILURE);
    }

    // Parent Process

Of course, this is only a problem if the external parser/converter script
can't be exec'ed by htdig, so if all is working well, this bug won't be
an issue.

--
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/ _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general




-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to