On Jan 19, 2007, at 8:22 PM, William Morgan wrote: > Stop words make a lot of sense for the ad-hoc task because they > eliminate "content-free" words. But I think they don't make nearly as > much sense for the uses that you and I have for Ferret. > > The other big difference, of course, is that disk space is much > cheaper > now than when this stuff was developed.
You've expressed pretty much the reasons why the default "PolyAnalyzer" configuration in KinoSearch consists of an LCNormalizer, a Tokenizer, and a Stemmer -- no Stopalizer. See <http://www.rectangular.com/downloads/KinoSearch_OSCON2006.pdf> pages 74-80. > Unfortunately all I have are opinions. :) I'd be very interested in an > empirical analysis of just how much bigger the index gets when using > stopwords (with and without term vectors), and just how much slower > queries get. I'm guessing that neither will be serious, but I could be > wrong. The search-time benefit from using a stoplist can be substantial. Search-time costs are dominated by time spent pawing through postings for common terms. Eliminating the most common terms can make a big difference. Marvin Humphrey Rectangular Research http://www.rectangular.com/ _______________________________________________ Ferret-talk mailing list [email protected] http://rubyforge.org/mailman/listinfo/ferret-talk

