Florian Gilcher wrote: > Every fulltext search I use > has a stopword-list by default. Mysql FULLTEXT for example even needs to > be recompiled if you want to change them. This is a massive, massive drawback. For web-apps on shared hosts, in the past I've had to resort to appending characters to each word to evade stop-word and minimum-length filtering, precisely because of this inane default, and you can imagine what that does to performance.
> I also want to argue that the > use of stopwords is very common. That doesn't make it correct. I see enough queries on this list alone from people surprised by the stop-word behaviour, or needing to change it because they need to support a language other than English, to believe that they should be dropped by default. > For example, if I have an index of > 1.000 english documents and search for 'and', chances are high that I > get a result set of 1000 hits - which is unusable. So what? The inverse isn't usable either - if 'and' is a stop-word, and you only search for 'and', you'll get no results at all. > Stopwords are more of a result than an performance optimization. That's just not the case - stop-words exist primarily to reduce the index size. Their effect on the result set is a product of the way you construct a stop-word list - by picking the words which impart the smallest amounts of information. > I cannot find it at the moment, but there was the point that 'premature' > optimization is bad. This may be wise for your own application, but the > libraries in use should be a) mature and b) optimized. I believe that point was mine. However, I was not referring to performance - traditionally stop-words have been used as a storage space reduction strategy, with typical results being a reduction in index size of between 20 and 30 percent. There may well be a correlated performance bump, but that's tangential. I'm not arguing that stop-words should not be available if you want them. I'm not even arguing against supplying a decent set of stop-words for as many different languages as possible. I am trying to argue that they should not be turned on by default. -- Alex _______________________________________________ Ferret-talk mailing list [email protected] http://rubyforge.org/mailman/listinfo/ferret-talk

