This patch fixes a problem in the 3.1.x text/html and text/plain parsers.
The parsers stop parsing as soon as they encounter an ASCII NUL (0)
character. While this problem has been around since the very early
days of htdig, it was only brought to light after the release of 3.1.6.
I guess that means there's not a lot of text documents out there that
contain nulls, thankfully. However, if this is a problem for you,
and fixing the documents isn't an easy option, you may want to apply
this patch.
NOTE: This patch may not be for everyone! It will more than likely
slow down parsing of documents, particularly on slower systems with not
a lot of RAM. The reason is the parser does an extra pass through the
in-memory copy of the document to find and replace nulls - this will
cause extra paging if the whole document doesn't stay in htdig's set of
resident pages.
Apply this patch in your main htdig-3.1.6 source directory using the
command: patch -p0 < this-message-file
--- htdig/HTML.cc.orig Thu Jan 31 17:47:17 2002
+++ htdig/HTML.cc Thu Feb 7 15:00:15 2002
@@ -146,6 +146,8 @@ HTML::parse(Retriever &retriever, URL &b
if (contents == 0 || contents->length() == 0)
return;
+ contents->replace('\0', ' ');
+
base = &baseURL;
//
--- htdig/Plaintext.cc.orig Thu Jan 31 17:47:17 2002
+++ htdig/Plaintext.cc Thu Feb 7 15:00:33 2002
@@ -40,6 +40,8 @@ Plaintext::parse(Retriever &retriever, U
if (contents == 0 || contents->length() == 0)
return;
+ contents->replace('\0', ' ');
+
unsigned char *position = (unsigned char *) contents->get();
unsigned char *start = position;
static int minimumWordLength = config.Value("minimum_word_length", 3);
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html