Indexing the stems is a good suggestion.  It would
certainly give faster searching.  If it replaced the
unstemmed inverted file then it would also save on storage
requirements, but it would mean we couldn't search on the
unstemmed version (if that is of concern).
The general strategy used by ht://Dig is post-indexing fuzzy matching. Certainly a Porter stemming fuzzy algorithm would be quite useful. But I'd say if we intend on indexing stems, it should definitely be optional. I can think of several instances where I'd want to search on one particular word, and *not* stemmed variants.

So I'd much rather see work into innovative fuzzy algorithms. Anyone want to add a real "spelling" fuzzy? What about a Porter endings fuzzy to replace/augment endings?

I have also been wondering if it is possible to turn off
word-level indexing, to give (much) smaller inverted files
if people don't need phrase searching.  Does anybody know?
Not at the moment.

But you lose a lot more than phrase searching. You lose field-restricted searching. You lose scoring by proximity (like Google). You lose the ability to score "on the fly"--not to be discounted since many users wonder why they change their scoring factors and the results don't change.

If you look at other search products, the basic strategy now is "index everything" and let the search frontend filter if needed. Yes, some even index words like the, and, not, etc.

Just my $0.02,
-Geoff



-------------------------------------------------------
This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power & Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev


Reply via email to