On Tue, 7 Dec 1999, Joe R. Jah wrote:
> > I can't understand why you didn't run into this with htdig 3.1.3 - the
> > problem definitely was there then and in previous releases. Did you
> > add .shtml/ to exclude_urls in the config for 3.1.3, but not 3.1.4?
>
> I can't understand it either. No I never had .shtml/ in my exclude_urls.
> This brings up an interesting point: Under 3.1.3 a search of the word
> Majordomo in my site would report 28 results; under 3.1.4 without .shtml
> in exclude_urls would report 54, including the SSI mangled URL's.
> Under 3.1.4 with .shtml/ in exclude_urls it reports 47 results;-/
>
> That means 3.1.4 finds 19 more unique files than 3.1.3 in this particular
> search, one of which is the mangled .shtml file;) I jumped into
> conclusion and reported duplicates in the results;(
This is something I'd love to get a handle on. While Loic's test suite is
very nice, I think we need some more tests. One that I propose is a more
macroscopic selection of files that we can set as our 'test corpus,' know
*exactly* how many file are available to be indexed and the word
frequencies.
What's my suggestion? I've usually used the htdig.org site as a check that
the indexing and searching works acceptably. Since it's currently at about
8,000 pages, I think it could be a decent benchmark. Does it make sense
to "freeze" the site as a tar or something, then use something *else* to
count pages, and a few test queries?
-Geoff
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.