On Jan 19, 2007, at 8:22 PM, William Morgan wrote:

> Stop words make a lot of sense for the ad-hoc task because they
> eliminate "content-free" words. But I think they don't make nearly as
> much sense for the uses that you and I have for Ferret.
>
> The other big difference, of course, is that disk space is much  
> cheaper
> now than when this stuff was developed.

You've expressed pretty much the reasons why the default  
"PolyAnalyzer" configuration in KinoSearch consists of an  
LCNormalizer, a Tokenizer, and a Stemmer -- no Stopalizer.  See  
<http://www.rectangular.com/downloads/KinoSearch_OSCON2006.pdf> pages  
74-80.

> Unfortunately all I have are opinions. :) I'd be very interested in an
> empirical analysis of just how much bigger the index gets when using
> stopwords (with and without term vectors), and just how much slower
> queries get. I'm guessing that neither will be serious, but I could be
> wrong.

The search-time benefit from using a stoplist can be substantial.   
Search-time costs are dominated by time spent pawing through postings  
for common terms.  Eliminating the most common terms can make a big  
difference.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to