According to Joe R. Jah:
> On Wed, 8 Dec 1999, Gilles Detillieux wrote:
> > Is it possible that you were getting the extra .shtml/ stuff, but just
> > weren't detecting it in your searches, or are you sure they never came up?
>
> For that particular keyword search I am sure they never came up.
No, what I meant was are you sure they never came up at all while
running htdig 3.1.3? If your only test for what documents were indexed
is a few htsearch commands, then it's not an exhaustive test of what's
been indexed. The implication I was responding to was that 3.1.3 didn't
index these
.shtml/ documents, but given what you've told me so far, I suspect that's
not the case.
I do find it interesting that the .shtml/ problem on your site didn't
lead to an infinite hierarchy of bad URLs, as a few other users had
reported previously when running into this SSI problem.
> > Where does the word appear in these 19 extra documents? If it's in img
> > alt text, or immediately after a bare ampersand (&), that would explain
> ^^^^^^^^
> > why htdig 3.1.3 or earlier failed to index that word in these documents.
> > If it appears elsewhere, I'd be very curious to know why htdig 3.1.3
> > missed it, and if it doesn't appear anywhere in the document or in
> > descriptions of hyperlinks to the documents, I'd like to know why htdig
> > 3.1.4 is putting it in the index. Please look into this further, if you
> > can, and get back to me ASAP. We'd like to release 3.1.4 tomorrow, but
> > not if it's putting incorrect entries in the index.
>
> Bingo;) They were all in img alt text, in that particular search.
Pheww! Not that I was that worried.
> > Wow, patches to 3.1.4 before it's even released! :)
>
> Yes, and I'd love to add another one to it, the max_keywords attribute I
> requested a month or so ago;)
Well, now you're just getting demanding, aren't you? ;-) I did give it
a vote for 3.2.0b1 back on October 12, but said I'm not volunteering for
the job. OK, so now I am... (Maybe someone else can adapt and document
it for 3.2.0b1?) And I'm sure Joe will volunteer to test it. :-)
This undocumented and untested patch adds the max_keywords attribute to
htdig, to index only as many keywords in meta tags, per document, as is
specified in the attribute value. A value of 0 means no limit. This
helps combat meta keyword spamming, but still leaves the problem that
the first n spam keywords in a document still get indexed, so searches
for these words will still pull up the spamming documents.
--- htcommon/defaults.cc.orig Mon Dec 6 16:14:04 1999
+++ htcommon/defaults.cc Wed Dec 8 11:36:27 1999
@@ -88,6 +88,7 @@ ConfigDefaults defaults[] =
{"max_doc_size", "100000"},
{"max_head_length", "512"},
{"max_hop_count", "999999"},
+ {"max_keywords", "0"},
{"max_meta_description_length", "512"},
{"max_prefix_matches", "1000"},
{"max_stars", "4"},
--- htdig/HTML.cc.orig Fri Dec 3 11:03:04 1999
+++ htdig/HTML.cc Wed Dec 8 11:44:54 1999
@@ -27,6 +27,8 @@ static StringMatch attrs;
static StringMatch srcMatch;
static StringMatch hrefMatch;
static StringMatch keywordsMatch;
+static int keywordsCount;
+static int max_keywords;
static int offset;
static int totlength;
@@ -98,6 +100,9 @@ HTML::HTML()
keywordsMatch.IgnoreCase();
keywordsMatch.Pattern(keywordNames.Join('|'));
keywordNames.Release();
+ max_keywords = config.Value("max_keywords", 0);
+ if (max_keywords == 0)
+ max_keywords = (int) ((unsigned int) ~1 >> 1);
word = 0;
href = 0;
@@ -150,6 +155,7 @@ HTML::parse(Retriever &retriever, URL &b
static char *skip_start = config["noindex_start"];
static char *skip_end = config["noindex_end"];
+ keywordsCount = 0;
offset = 0;
title = 0;
head = 0;
@@ -792,7 +798,8 @@ HTML::do_tag(Retriever &retriever, Strin
char *w = HtWordToken(transSGML(keywords));
while (w && doindex)
{
- if (strlen(w) >= minimumWordLength)
+ if (strlen(w) >= minimumWordLength
+ && ++keywordsCount <= max_keywords)
retriever.got_word(w, 1, 10);
w = HtWordToken(0);
}
@@ -875,7 +882,8 @@ HTML::do_tag(Retriever &retriever, Strin
char *w = HtWordToken(transSGML(conf["content"]));
while (w && doindex)
{
- if (strlen(w) >= minimumWordLength)
+ if (strlen(w) >= minimumWordLength
+ && ++keywordsCount <= max_keywords)
retriever.got_word(w, 1, 10);
w = HtWordToken(0);
}
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.