htdig: Patch for bug: non-word (non-punctuation) characters included from descriptions

Hans-Peter Nilsson Wed, 6 Jan 1999 00:28:55 -0500
There's a bug (gasp! but at least it's harmless) that causes
inclusion of non-word characters into the database.

Non-word characters are painstakingly removed from words in the
text proper (see HTML.cc), but kept in descriptions for obvious
reasons.  However, when the words from a description are
weighted and added to the list of words to index, these
characters go in for free (and I don't mean the
"valid_punctuation" characters).

Since the invalid characters are ignored in htsearch (they are
removed before the search, hence words that have them will not
match), they just take up room in the database, and no-one will
ever see them.

Here's an example, indexing the single document:
<html>
<head><title>gross</title></head>
<body>
<a href="nosupper.html">"I don't want no supper!" he said.</a>
</body>
</html>

With valid_punctuation set as (just the asterisk):
 valid_punctuation: *

And max_description_length (default-value, but matters):
 max_description_length: 60

the wordlist comes out like this, before merging (which as you
may know just sorts the words):

 said.  l:0     i:1     w:150000
 supper!"       l:0     i:1     w:150000
 don't  l:0     i:1     w:150000
 want   l:0     i:1     w:150000
 don    l:587   i:0     w:413
 supper l:698   i:0     w:302
 gross  l:158   i:0     w:84200
 said   l:793   i:0     w:207
 want   l:634   i:0     w:366

(BTW, the words are counted twice; once for being in the text,
and once for being in a description.  I believe it's ok, it's
supposed to be like that.)

Here's a patch to handle characters in description-words just
like for other words:

Wed Jan  6 05:53:02 1999  Hans-Peter Nilsson  <[EMAIL PROTECTED]>

        * htcommon/DocumentRef.cc (AddDescription): Do not add non-word
        characters to the wordlist.

Index: DocumentRef.cc
===================================================================
RCS file: /opt/htdig/cvs/htdig3/htcommon/DocumentRef.cc,v
retrieving revision 1.11
diff -p -c -r1.11 DocumentRef.cc
*** DocumentRef.cc      1999/01/05 19:35:42     1.11
--- DocumentRef.cc      1999/01/06 04:46:45
*************** void DocumentRef::AddDescription(char *d
*** 322,341 ****
  
      words->DocumentID(docID);
      
!     char    *w = strtok(desc, " ,\t\r\n");
!     while (w)
!       {
!       if (strlen(w) >= config.Value("minimum_word_length", 3))
!         {
!           String word = w;
!           word.lowercase();
!           word.remove(config["valid_punctuation"]);
!           if (word.length() >= config.Value("minimum_word_length", 3))
!             words->Word(word, 0, 0, config.Double("description_factor"));
!         }
!       w = strtok(0, " ,\t\r\n");
!       }
!     w = '\0';
      // And let's flush the words!
      words->Flush();
      
--- 322,357 ----
  
      words->DocumentID(docID);
      
!     // Parse words, taking care of valid_punctuation.
!     char *p                   = desc;
!     char *valid_punctuation   = config["valid_punctuation"];
!     int   minimum_word_length = config.Value("minimum_word_length", 3);
! 
!     // Not restricted to this size, just used as a hint.
!     String word(MAX_WORD_LENGTH);
! 
!     if (!valid_punctuation)
!       valid_punctuation = "";
! 
!     while (*p)
!     {
!       // Reset contents before adding chars each round.
!       word = 0;
! 
!       while (*p && (isalnum(*p) || strchr(valid_punctuation, *p)))
!         word << *p++;
! 
!       word.remove(valid_punctuation);
! 
!       if (word.length() >= minimum_word_length)
!         // The wordlist takes care of lowercasing; just add it.
!         words->Word(word, 0, 0, config.Double("description_factor"));
! 
!       // No need to count in valid_punctuation for the beginning-char.
!       while (*p && !isalnum(*p))
!         p++;
!     }
! 
      // And let's flush the words!
      words->Flush();
      
brgds, H-P
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.
htdig: Patch for bug: non-word (non-punctuation) characters included from descriptions

Reply via email to