According to Joe R. Jah:
> On Wed, 8 Dec 1999, Gilles Detillieux wrote:
> > Is it possible that you were getting the extra .shtml/ stuff, but just
> > weren't detecting it in your searches, or are you sure they never came up?
> 
> For that particular keyword search I am sure they never came up.

No, what I meant was are you sure they never came up at all while
running htdig 3.1.3?  If your only test for what documents were indexed
is a few htsearch commands, then it's not an exhaustive test of what's
been indexed.  The implication I was responding to was that 3.1.3 didn't
index these
.shtml/ documents, but given what you've told me so far, I suspect that's
not the case.

I do find it interesting that the .shtml/ problem on your site didn't
lead to an infinite hierarchy of bad URLs, as a few other users had
reported previously when running into this SSI problem.

> > Where does the word appear in these 19 extra documents?  If it's in img
> > alt text, or immediately after a bare ampersand (&), that would explain
>   ^^^^^^^^
> > why htdig 3.1.3 or earlier failed to index that word in these documents.
> > If it appears elsewhere, I'd be very curious to know why htdig 3.1.3
> > missed it, and if it doesn't appear anywhere in the document or in
> > descriptions of hyperlinks to the documents, I'd like to know why htdig
> > 3.1.4 is putting it in the index.  Please look into this further, if you
> > can, and get back to me ASAP.  We'd like to release 3.1.4 tomorrow, but
> > not if it's putting incorrect entries in the index.
> 
> Bingo;)  They were all in img alt text, in that particular search.

Pheww!  Not that I was that worried.

> > Wow, patches to 3.1.4 before it's even released!  :)
> 
> Yes, and I'd love to add another one to it, the max_keywords attribute I
> requested a month or so ago;)

Well, now you're just getting demanding, aren't you?  ;-)  I did give it
a vote for 3.2.0b1 back on October 12, but said I'm not volunteering for
the job.  OK, so now I am...  (Maybe someone else can adapt and document
it for 3.2.0b1?)  And I'm sure Joe will volunteer to test it.  :-)

This undocumented and untested patch adds the max_keywords attribute to
htdig, to index only as many keywords in meta tags, per document, as is
specified in the attribute value.  A value of 0 means no limit.  This
helps combat meta keyword spamming, but still leaves the problem that
the first n spam keywords in a document still get indexed, so searches
for these words will still pull up the spamming documents.

--- htcommon/defaults.cc.orig   Mon Dec  6 16:14:04 1999
+++ htcommon/defaults.cc        Wed Dec  8 11:36:27 1999
@@ -88,6 +88,7 @@ ConfigDefaults        defaults[] =
     {"max_doc_size",                   "100000"},
     {"max_head_length",                        "512"},
     {"max_hop_count",                  "999999"},
+    {"max_keywords",                   "0"},
     {"max_meta_description_length",     "512"},
     {"max_prefix_matches",             "1000"},
     {"max_stars",                      "4"},
--- htdig/HTML.cc.orig  Fri Dec  3 11:03:04 1999
+++ htdig/HTML.cc       Wed Dec  8 11:44:54 1999
@@ -27,6 +27,8 @@ static StringMatch    attrs;
 static StringMatch     srcMatch;
 static StringMatch     hrefMatch;
 static StringMatch     keywordsMatch;
+static int             keywordsCount;
+static int             max_keywords;
 static int             offset;
 static int             totlength;
 
@@ -98,6 +100,9 @@ HTML::HTML()
     keywordsMatch.IgnoreCase();
     keywordsMatch.Pattern(keywordNames.Join('|'));
     keywordNames.Release();
+    max_keywords = config.Value("max_keywords", 0);
+    if (max_keywords == 0)
+       max_keywords = (int) ((unsigned int) ~1 >> 1);
     
     word = 0;
     href = 0;
@@ -150,6 +155,7 @@ HTML::parse(Retriever &retriever, URL &b
     static char         *skip_start = config["noindex_start"];
     static char         *skip_end = config["noindex_end"];
 
+    keywordsCount = 0;
     offset = 0;
     title = 0;
     head = 0;
@@ -792,7 +798,8 @@ HTML::do_tag(Retriever &retriever, Strin
                char    *w = HtWordToken(transSGML(keywords));
                while (w && doindex)
                {
-                   if (strlen(w) >= minimumWordLength)
+                   if (strlen(w) >= minimumWordLength
+                               && ++keywordsCount <= max_keywords)
                      retriever.got_word(w, 1, 10);
                    w = HtWordToken(0);
                }
@@ -875,7 +882,8 @@ HTML::do_tag(Retriever &retriever, Strin
                    char        *w = HtWordToken(transSGML(conf["content"]));
                    while (w && doindex)
                    {
-                       if (strlen(w) >= minimumWordLength)
+                       if (strlen(w) >= minimumWordLength
+                               && ++keywordsCount <= max_keywords)
                          retriever.got_word(w, 1, 10);
                        w = HtWordToken(0);
                    }

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 

Reply via email to