On Mar 31, 2007, at 10:41 AM, Andreas Korth wrote: > @David: You should probably consider changing StandardAnalyzer not to > use stop words by default. It confuses people because no one would > suspect such a feature to be enabled by default. It just doesn't > follow the principle of least astonishment. > > Even if people want to use stop words, they might not be happy with > the ones built into Ferret. It very much depends on the nature of the > content that is indexed and instead of using a one-size-fit-all stop > word list one is usually better off with compiling a custom one for > any particular application.
I concur. Ferret's StandardAnalyzer is based upon Lucene's class of the same name, so some parallelism would be lost, but I think omitting stop lists is better nonetheless. There are performance and disk-space implications for avoiding stop lists by default. However, disk space is cheap, Ferret is fast, and search results are slightly better when you avoid stop lists (e.g. searching for '"the who"' actually returns something). Users with large deployments will be able to trade away some amount of IR precision for increased performance by enabling stop lists if they so choose. KinoSearch doesn't have a StandardAnalyzer; a class called PolyAnalyzer fills that role. By default, it performs lowercasing, tokenizing and stemming -- but no stopalizing. <http:// www.rectangular.com/kinosearch/docs/devel/KinoSearch/Analysis/ PolyAnalyzer.html> Marvin Humphrey Rectangular Research http://www.rectangular.com/ _______________________________________________ Ferret-talk mailing list [email protected] http://rubyforge.org/mailman/listinfo/ferret-talk

