According to Geoff Hutchison:
> On Tue, 7 Dec 1999, Joe R. Jah wrote:
> > > I can't understand why you didn't run into this with htdig 3.1.3 - the
> > > problem definitely was there then and in previous releases.  Did you
> > > add .shtml/ to exclude_urls in the config for 3.1.3, but not 3.1.4?
> > 
> > I can't understand it either.  No I never had .shtml/ in my exclude_urls. 
> > This brings up an interesting point: Under 3.1.3 a search of the word
> > Majordomo in my site would report 28 results; under 3.1.4 without .shtml
> > in exclude_urls would report 54, including the SSI mangled URL's. 
> > Under 3.1.4 with .shtml/ in exclude_urls it reports 47 results;-/
> > 
> > That means 3.1.4 finds 19 more unique files than 3.1.3 in this particular
> > search, one of which is the mangled .shtml file;)  I jumped into
> > conclusion and reported duplicates in the results;(
> 
> This is something I'd love to get a handle on. While Loic's test suite is
> very nice, I think we need some more tests. One that I propose is a more
> macroscopic selection of files that we can set as our 'test corpus,' know
> *exactly* how many file are available to be indexed and the word
> frequencies.

I'd like to get a handle on this too.  However, there may be legitimate
reasons for word frequencies to change as a result of fixes/enhancements
to the parser, as is the case with the two changes in 3.1.4, namely
the bare ampersand bug fix, and the handling of img alt text, which
can both lead to increased word frequencies.  Another fix to 3.1.4,
to avoid indexing meta keywords or descriptions while under a "noindex"
condition, would also serve to decrease the word frequencies.  Any test
suite would have to allow for this.

> What's my suggestion? I've usually used the htdig.org site as a check that
> the indexing and searching works acceptably. Since it's currently at about
> 8,000 pages, I think it could be a decent benchmark. Does it make sense
> to "freeze" the site as a tar or something, then use something *else* to
> count pages, and a few test queries?

I think a maindocs snapshot would make a good test suite.  By something
else, I assume you mean something like find and grep.  It would also
make sense to have a snapshot of earlier htdig/htsearch output to compare
against, but again, you'd have to account for legitimate changes.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 

Reply via email to