There's a bug (gasp! but at least it's harmless) that causes
inclusion of non-word characters into the database.
Non-word characters are painstakingly removed from words in the
text proper (see HTML.cc), but kept in descriptions for obvious
reasons. However, when the words from a description are
weighted and added to the list of words to index, these
characters go in for free (and I don't mean the
"valid_punctuation" characters).
Since the invalid characters are ignored in htsearch (they are
removed before the search, hence words that have them will not
match), they just take up room in the database, and no-one will
ever see them.
Here's an example, indexing the single document:
<html>
<head><title>gross</title></head>
<body>
<a href="nosupper.html">"I don't want no supper!" he said.</a>
</body>
</html>
With valid_punctuation set as (just the asterisk):
valid_punctuation: *
And max_description_length (default-value, but matters):
max_description_length: 60
the wordlist comes out like this, before merging (which as you
may know just sorts the words):
said. l:0 i:1 w:150000
supper!" l:0 i:1 w:150000
don't l:0 i:1 w:150000
want l:0 i:1 w:150000
don l:587 i:0 w:413
supper l:698 i:0 w:302
gross l:158 i:0 w:84200
said l:793 i:0 w:207
want l:634 i:0 w:366
(BTW, the words are counted twice; once for being in the text,
and once for being in a description. I believe it's ok, it's
supposed to be like that.)
Here's a patch to handle characters in description-words just
like for other words:
Wed Jan 6 05:53:02 1999 Hans-Peter Nilsson <[EMAIL PROTECTED]>
* htcommon/DocumentRef.cc (AddDescription): Do not add non-word
characters to the wordlist.
Index: DocumentRef.cc
===================================================================
RCS file: /opt/htdig/cvs/htdig3/htcommon/DocumentRef.cc,v
retrieving revision 1.11
diff -p -c -r1.11 DocumentRef.cc
*** DocumentRef.cc 1999/01/05 19:35:42 1.11
--- DocumentRef.cc 1999/01/06 04:46:45
*************** void DocumentRef::AddDescription(char *d
*** 322,341 ****
words->DocumentID(docID);
! char *w = strtok(desc, " ,\t\r\n");
! while (w)
! {
! if (strlen(w) >= config.Value("minimum_word_length", 3))
! {
! String word = w;
! word.lowercase();
! word.remove(config["valid_punctuation"]);
! if (word.length() >= config.Value("minimum_word_length", 3))
! words->Word(word, 0, 0, config.Double("description_factor"));
! }
! w = strtok(0, " ,\t\r\n");
! }
! w = '\0';
// And let's flush the words!
words->Flush();
--- 322,357 ----
words->DocumentID(docID);
! // Parse words, taking care of valid_punctuation.
! char *p = desc;
! char *valid_punctuation = config["valid_punctuation"];
! int minimum_word_length = config.Value("minimum_word_length", 3);
!
! // Not restricted to this size, just used as a hint.
! String word(MAX_WORD_LENGTH);
!
! if (!valid_punctuation)
! valid_punctuation = "";
!
! while (*p)
! {
! // Reset contents before adding chars each round.
! word = 0;
!
! while (*p && (isalnum(*p) || strchr(valid_punctuation, *p)))
! word << *p++;
!
! word.remove(valid_punctuation);
!
! if (word.length() >= minimum_word_length)
! // The wordlist takes care of lowercasing; just add it.
! words->Word(word, 0, 0, config.Double("description_factor"));
!
! // No need to count in valid_punctuation for the beginning-char.
! while (*p && !isalnum(*p))
! p++;
! }
!
// And let's flush the words!
words->Flush();
brgds, H-P
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.