According to =?iso-8859-1?Q?Josep_Llaurad=F3_Selvas?=:
> I get the next lines as output of htdig (with -vvv):
...
> Header line: HTTP/1.1 200 OK
> Header line: Date: Fri, 26 Jan 2001 09:43:09 GMT
> Header line: Server: Apache/1.3.9 (Unix) Debian/GNU PHP/4.0.3pl1
> Header line: Last-Modified: Mon, 30 Oct 2000 15:29:29 GMT
> Translated Mon, 30 Oct 2000 15:29:29 GMT to 2000-10-30 15:29:29 (100)
> And converted to Mon, 30 Oct 2000 15:29:29
> Header line: ETag: "44222-5400-39fd93d9"
> Header line: Accept-Ranges: bytes
> Header line: Content-Length: 21504
> Header line: Connection: close
> Header line: Content-Type: application/msword; charset=iso-8859-1
> not HTML
>
> ------------------------------------------------------------------------------
>
> As I can understand, the htdig don't get Word documents, 'cos closes the
> connection to the server saying that the document isn't an HTML document.
>
> In my config file I have the following lines:
>
> external_parsers: application/msword->text/html /usr/share/htdig/parse_doc.pl \
> application/postscript->text/html /usr/share/htdig/parse_doc.pl \
> application/pdf->text/html /usr/share/htdig/parse_doc.pl
>
> but the parser seems to miss the external parsers or something like
> that. I have tried the parse_doc and the conv_doc perl scripts and don't
> run, also the syntax:
>
> 'application/msword->text/html' and the 'application/msword'
>
> without any satisfactory result. What I'm doing wrong?
>
> I'm using Debian GNU/Linux 2.2 Potato with htdig-3.1.5-2
First of all, get rid of parse_doc.pl, and use conv_doc.pl or doc2html.pl.
That won't solve this particular problem, but it will avoid a whole lot
of other problems down the road.
The syntax 'application/msword->text/html' is for external converters,
like conv_doc.pl and doc2html.pl, which output HTML code, rather than
an external parser which outputs its own unique codes for parsed text.
See http://htdig.sourceforge.net/attrs.html#external_parsers for details.
The problem with htdig not recognising the external parser definitions
is a known bug, which seems to show up on recent Debian configurations.
There are two fixes for it. One is to configure Apache not to add the
"charset=" string to the Content-Type header, which is what confuses
htdig.
Here is the suggestion from Elijah Kagan <[EMAIL PROTECTED]>:
There are two parameters in Apache config file that tell it to add a
charset field by default. They are: AddDefaultCharset and
AddDefaultCharsetName. The first one should be set to off to prevent
Apache from replying with a charset field set after the content type.
After disabling AddDefaultCharset htdig worked as expected.
The other fix is to apply a patch to htdig/ExternalParser.cc to make
it correctly handle these headers. Applying this patch is a bit more
work than the quick fix above, but it gets to the root of the problem.
This patch also fixes a few other bugs in the external parser support.
It's backported from the 3.2.0b3 development code. The patch archive
seems to be down right now, so I'll repost the patch here...
According to Elijah Kagan:
> I run htdig 3.1.5.
> I tried both the Debian package and a compiled one with the same result.
> I am absolutely sure there is something stupid I forgot to put into the
> configuration.
OK, after getting to the bottom of this (I think!), I have backported
the 3.2.0b3 development code for htdig/ExternalParser.cc to version
3.1.5, to fix this and other problems. Please give this patch file
a try and let me know if it works. You will probably get a warning
about the wait() function being implicitly declared, unless you manually
define HAVE_WAIT_H or HAVE_SYS_WAIT_H (depending on whether your system
has <wait.h> or <sys/wait.h>). Also, if your system has the mkstemp()
function, you may want to define HAVE_MKSTEMP manually as well, as this
will enhance security. I didn't have time to figure out how to patch
aclocal.m4 and configure to add tests for all of these.
The patch fixes the following problems in external_parsers support in
3.1.5:
- it got confused by "; charset=..." in the Content-Type header,
as described in "http://www.htdig.org/mail/2000/09/index.html#75".
- security problems with using popen(), and therefore the shell,
to parse URL and content-type strings from untrusted sources
(now uses pipe/fork/exec instead of popen) - PR#542, PR#951.
- used predictable temporary file name, which could be exploited
via symlinks - fixed if mkstemp() exists & HAVE_MKSTEMP is defined.
- binary output from an external converter could get mangled.
- error messages were sometimes ambiguous or missing altogether.
- didn't open temporary file in binary mode for non-Unix systems
(attempts were made to fix this, but it's not clear yet whether
the security fixes and pipe/fork/exec will port well to Cygwin).
Here's the patch, which you can apply in the main source directory for
htdig-3.1.5 using "patch -p0 < this-file":
--- htdig/ExternalParser.cc.orig Thu Feb 24 20:29:10 2000
+++ htdig/ExternalParser.cc Mon Jan 15 17:16:50 2001
@@ -1,14 +1,24 @@
//
// ExternalParser.cc
//
-// Implementation of ExternalParser
-// Allows external programs to parse unknown document formats.
-// The parser is expected to return the document in a specific format.
-// The format is documented in http://www.htdig.org/attrs.html#external_parser
+// ExternalParser: Implementation of ExternalParser
+// Allows external programs to parse unknown document formats.
+// The parser is expected to return the document in a
+// specific format. The format is documented
+// in http://www.htdig.org/attrs.html#external_parser
//
-#if RELEASE
-static char RCSid[] = "$Id: ExternalParser.cc,v 1.9.2.3 1999/11/24 02:14:09 grdetil
Exp $";
-#endif
+// Part of the ht://Dig package <http://www.htdig.org/>
+// Copyright (c) 1995-2001 The ht://Dig Group
+// For copyright details, see the file COPYING in your distribution
+// or the GNU Public License version 2 or later
+// <http://www.gnu.org/copyleft/gpl.html>
+//
+// $Id: ExternalParser.cc,v 1.9.2.4 2001/01/15 17:16:50 grdetil Exp $
+//
+
+#ifdef HAVE_CONFIG_H
+#include "htconfig.h"
+#endif /* HAVE_CONFIG_H */
#include "ExternalParser.h"
#include "HTML.h"
@@ -19,9 +29,18 @@ static char RCSid[] = "$Id: ExternalPars
#include "QuotedStringList.h"
#include "URL.h"
#include "Dictionary.h"
+#include "good_strtok.h"
+
#include <ctype.h>
#include <stdio.h>
-#include "good_strtok.h"
+#include <unistd.h>
+#include <stdlib.h>
+#include <fcntl.h>
+#ifdef HAVE_WAIT_H
+#include <wait.h>
+#elif HAVE_SYS_WAIT_H
+#include <sys/wait.h>
+#endif
static Dictionary *parsers = 0;
static Dictionary *toTypes = 0;
@@ -32,9 +51,18 @@ extern String configFile;
//
ExternalParser::ExternalParser(char *contentType)
{
+ String mime;
+ int sep;
+
if (canParse(contentType))
{
- currentParser = ((String *)parsers->Find(contentType))->get();
+ String mime = contentType;
+ mime.lowercase();
+ sep = mime.indexOf(';');
+ if (sep != -1)
+ mime = mime.sub(0, sep).get();
+
+ currentParser = ((String *)parsers->Find(mime))->get();
}
ExternalParser::contentType = contentType;
}
@@ -89,6 +117,8 @@ ExternalParser::readLine(FILE *in, Strin
int
ExternalParser::canParse(char *contentType)
{
+ int sep;
+
if (!parsers)
{
parsers = new Dictionary();
@@ -97,7 +127,6 @@ ExternalParser::canParse(char *contentTy
QuotedStringList qsl(config["external_parsers"], " \t");
String from, to;
int i;
- int sep;
for (i = 0; qsl[i]; i += 2)
{
@@ -109,11 +138,22 @@ ExternalParser::canParse(char *contentTy
to = from.sub(sep+2).get();
from = from.sub(0, sep).get();
}
+ from.lowercase();
+ sep = from.indexOf(';');
+ if (sep != -1)
+ from = from.sub(0, sep).get();
+
parsers->Add(from, new String(qsl[i + 1]));
toTypes->Add(from, new String(to));
}
}
- return parsers->Exists(contentType);
+
+ String mime = contentType;
+ mime.lowercase();
+ sep = mime.indexOf(';');
+ if (sep != -1)
+ mime = mime.sub(0, sep).get();
+ return parsers->Exists(mime);
}
//*****************************************************************************
@@ -132,52 +172,128 @@ ExternalParser::parse(Retriever &retriev
// Write the contents to a temporary file.
//
String path = getenv("TMPDIR");
+ int fd;
if (path.length() == 0)
path = "/tmp";
- path << "/htdext." << getpid();
+#ifndef HAVE_MKSTEMP
+ path << "/htdext." << getpid(); // This is unfortunately predictable
- FILE *fl = fopen(path, "w");
- if (!fl)
+#ifdef O_BINARY
+ fd = open((char*)path, O_WRONLY|O_CREAT|O_EXCL|O_BINARY);
+#else
+ fd = open((char*)path, O_WRONLY|O_CREAT|O_EXCL);
+#endif
+#else
+ path << "/htdex.XXXXXX";
+ fd = mkstemp((char*)path);
+ // can we force binary mode somehow under Cygwin, if it has mkstemp?
+#endif
+ if (fd < 0)
{
- return;
+ if (debug)
+ cout << "External parser error: Can't create temp file "
+ << (char *)path << endl;
+ return;
}
- fwrite(contents->get(), 1, contents->length(), fl);
- fclose(fl);
-
- //
- // Now start the external parser.
- //
- String command = currentParser;
- command << ' ' << path << ' ' << contentType << " \"" << base.get() <<
- "\" " << configFile;
-
- FILE *input = popen(command, "r");
- if (!input)
- {
- unlink(path);
- return;
- }
-
- unsigned int minimum_word_length
- = config.Value("minimum_word_length", 3);
+ write(fd, contents->get(), contents->length());
+ close(fd);
+ unsigned int minimum_word_length = config.Value("minimum_word_length", 3);
String line;
char *token1, *token2, *token3;
- int loc, hd;
+ int loc = 0, hd = 0;
URL url;
- String convertToType = ((String *)toTypes->Find(contentType))->get();
+ String mime = contentType;
+ mime.lowercase();
+ int sep = mime.indexOf(';');
+ if (sep != -1)
+ mime = mime.sub(0, sep).get();
+ String convertToType = ((String *)toTypes->Find(mime))->get();
int get_hdr = (mystrcasecmp(convertToType, "user-defined") == 0);
int get_file = (convertToType.length() != 0);
String newcontent;
- while (readLine(input, line))
+
+ StringList cpargs(currentParser);
+ char **parsargs = new char * [cpargs.Count() + 5];
+ int argi;
+ for (argi = 0; argi < cpargs.Count(); argi++)
+ parsargs[argi] = (char *)cpargs[argi];
+ parsargs[argi++] = path.get();
+ parsargs[argi++] = contentType.get();
+ parsargs[argi++] = (char *)base.get();
+ parsargs[argi++] = configFile.get();
+ parsargs[argi++] = 0;
+
+ int stdout_pipe[2];
+ int fork_result = -1;
+ int fork_try;
+
+ if (pipe(stdout_pipe) == -1)
+ {
+ if (debug)
+ cout << "External parser error: Can't create pipe!" << endl;
+ unlink((char*)path);
+ delete [] parsargs;
+ return;
+ }
+
+ for (fork_try = 4; --fork_try >= 0;)
+ {
+ fork_result = fork(); // Fork so we can execute in the child process
+ if (fork_result != -1)
+ break;
+ if (fork_try)
+ sleep(3);
+ }
+ if (fork_result == -1)
+ {
+ if (debug)
+ cout << "Fork Failure in ExternalParser" << endl;
+ unlink((char*)path);
+ delete [] parsargs;
+ return;
+ }
+
+ if (fork_result == 0) // Child process
+ {
+ close(STDOUT_FILENO); // Close handle STDOUT to replace with pipe
+ dup(stdout_pipe[1]);
+ close(stdout_pipe[0]);
+ close(stdout_pipe[1]);
+ close(STDIN_FILENO); // Close STDIN to replace with file
+ open((char*)path, O_RDONLY);
+
+ // Call External Parser
+ execv(parsargs[0], parsargs);
+
+ exit(EXIT_FAILURE);
+ }
+
+ // Parent Process
+ delete [] parsargs;
+ close(stdout_pipe[1]); // Close STDOUT for writing
+#ifdef O_BINARY
+ FILE *input = fdopen(stdout_pipe[0], "rb");
+#else
+ FILE *input = fdopen(stdout_pipe[0], "r");
+#endif
+ if (input == NULL)
+ {
+ if (debug)
+ cout << "Fdopen Failure in ExternalParser" << endl;
+ unlink((char*)path);
+ return;
+ }
+
+ while ((!get_file || get_hdr) && readLine(input, line))
{
if (get_hdr)
{
line.chop('\r');
if (line.length() == 0)
get_hdr = FALSE;
- else if (mystrncasecmp(line, "content-type:", 13) == 0)
+ else if (mystrncasecmp((char*)line, "content-type:", 13) == 0)
{
token1 = line.get() + 13;
while (*token1 && isspace(*token1))
@@ -187,24 +303,9 @@ ExternalParser::parse(Retriever &retriev
}
continue;
}
- if (get_file)
- {
- if (newcontent.length() == 0 &&
- !canParse(convertToType) &&
- mystrncasecmp(convertToType, "text/", 5) != 0 &&
- mystrncasecmp(convertToType, "application/pdf", 15) != 0)
- {
- if (mystrcasecmp(convertToType, "user-defined") == 0)
- cerr << "External parser error: no Content-Type given\n";
- else
- cerr << "External parser error: can't parse Content-Type \""
- << convertToType << "\"\n";
- cerr << " URL: " << base.get() << "\n";
- break;
- }
- newcontent << line << '\n';
- continue;
- }
+#ifdef O_BINARY
+ line.chop('\r');
+#endif
token1 = strtok(line, "\t");
if (token1 == NULL)
token1 = "";
@@ -223,7 +324,7 @@ ExternalParser::parse(Retriever &retriev
(hd = atoi(token3)) >= 0 && hd < 12)
retriever.got_word(token1, loc, hd);
else
- cerr<< "External parser error in line:"<<line<<"\n" << " URL: " <<
base.get() << "\n";
+ cerr<< "External parser error: expected word in line "<<line<<"\n"
+<< " URL: " << base.get() << "\n";
break;
case 'u': // href
@@ -236,7 +337,7 @@ ExternalParser::parse(Retriever &retriev
retriever.got_href(url, token2);
}
else
- cerr<< "External parser error in line:"<<line<<"\n" << " URL: " <<
base.get() << "\n";
+ cerr<< "External parser error: expected URL in line "<<line<<"\n" <<
+" URL: " << base.get() << "\n";
break;
case 't': // title
@@ -244,7 +345,7 @@ ExternalParser::parse(Retriever &retriev
if (token1 != NULL)
retriever.got_title(token1);
else
- cerr<< "External parser error in line:"<<line<<"\n" << " URL: " <<
base.get() << "\n";
+ cerr<< "External parser error: expected title in line "<<line<<"\n"
+<< " URL: " << base.get() << "\n";
break;
case 'h': // head
@@ -252,7 +353,7 @@ ExternalParser::parse(Retriever &retriev
if (token1 != NULL)
retriever.got_head(token1);
else
- cerr<< "External parser error in line:"<<line<<"\n" << " URL: " <<
base.get() << "\n";
+ cerr<< "External parser error: expected text in line "<<line<<"\n"
+<< " URL: " << base.get() << "\n";
break;
case 'a': // anchor
@@ -260,7 +361,7 @@ ExternalParser::parse(Retriever &retriev
if (token1 != NULL)
retriever.got_anchor(token1);
else
- cerr<< "External parser error in line:"<<line<<"\n" << " URL: " <<
base.get() << "\n";
+ cerr<< "External parser error: expected anchor in line "<<line<<"\n"
+<< " URL: " << base.get() << "\n";
break;
case 'i': // image url
@@ -268,7 +369,7 @@ ExternalParser::parse(Retriever &retriev
if (token1 != NULL)
retriever.got_image(token1);
else
- cerr<< "External parser error in line:"<<line<<"\n" << " URL: " <<
base.get() << "\n";
+ cerr<< "External parser error: expected image URL in line
+"<<line<<"\n" << " URL: " << base.get() << "\n";
break;
case 'm': // meta
@@ -310,7 +411,7 @@ ExternalParser::parse(Retriever &retriev
if (mystrcasecmp(httpEquiv, "refresh") == 0
&& *content != '\0')
{
- char *q = mystrcasestr(content, "url=");
+ char *q = (char*)mystrcasestr(content, "url=");
if (q && *q)
{
q += 4; // skiping "URL="
@@ -365,7 +466,7 @@ ExternalParser::parse(Retriever &retriev
meta_dsc = meta_dsc.sub(0, max_meta_description_length).get();
if (debug > 1)
cout << "META Description: " << content << endl;
- retriever.got_meta_dsc(meta_dsc);
+ retriever.got_meta_dsc((char*)meta_dsc);
//
// Now add the words to the word list
@@ -382,17 +483,42 @@ ExternalParser::parse(Retriever &retriev
}
}
else
- cerr<< "External parser error in line:"<<line<<"\n" << " URL: " <<
base.get() << "\n";
+ cerr<< "External parser error: expected metadata in line
+"<<line<<"\n" << " URL: " << base.get() << "\n";
break;
}
default:
- cerr<< "External parser error in line:"<<line<<"\n" << " URL: " <<
base.get() << "\n";
+ cerr<< "External parser error: unknown field in line "<<line<<"\n"
+<< " URL: " << base.get() << "\n";
break;
}
+ } // while(readLine)
+ if (get_file)
+ {
+ if (!canParse(convertToType) &&
+ mystrncasecmp((char*)convertToType, "text/", 5) != 0 &&
+ mystrncasecmp((char*)convertToType, "application/pdf", 15) != 0)
+ {
+ if (mystrcasecmp((char*)convertToType, "user-defined") == 0)
+ cerr << "External parser error: no Content-Type given\n";
+ else
+ cerr << "External parser error: can't parse Content-Type \""
+ << convertToType << "\"\n";
+ cerr << " URL: " << base.get() << "\n";
+ }
+ else
+ {
+ char buffer[2048];
+ int length;
+ while ((length = fread(buffer, 1, sizeof(buffer), input)) > 0)
+ newcontent.append(buffer, length);
+ }
}
- pclose(input);
- unlink(path);
+ fclose(input);
+ // close(stdout_pipe[0]); // This is closed for us by the fclose()
+ int rpid, status;
+ while ((rpid = wait(&status)) != fork_result && rpid != -1)
+ ;
+ unlink((char*)path);
if (newcontent.length() > 0)
{
@@ -407,19 +533,19 @@ ExternalParser::parse(Retriever &retriev
currentParser = ((String *)parsers->Find(contentType))->get();
parsable = this;
}
- else if (mystrncasecmp(contentType, "text/html", 9) == 0)
+ else if (mystrncasecmp((char*)contentType, "text/html", 9) == 0)
{
if (!html)
html = new HTML();
parsable = html;
}
- else if (mystrncasecmp(contentType, "text/plain", 10) == 0)
+ else if (mystrncasecmp((char*)contentType, "text/plain", 10) == 0)
{
if (!plaintext)
plaintext = new Plaintext();
parsable = plaintext;
}
- else if (mystrncasecmp(contentType, "application/pdf", 15) == 0)
+ else if (mystrncasecmp((char*)contentType, "application/pdf", 15) == 0)
{
if (!pdf)
pdf = new PDF();
@@ -432,7 +558,7 @@ ExternalParser::parse(Retriever &retriev
parsable = plaintext;
if (debug)
cout << "External parser error: \"" << contentType <<
- "\" not a recognized type. Assuming text\n";
+ "\" not a recognized type. Assuming text/plain\n";
}
parsable->setContents(newcontent.get(), newcontent.length());
parsable->parse(retriever, base);
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
_______________________________________________
htdig-general mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/htdig-general