Florian Gilcher wrote:
> Every fulltext search I use
> has a stopword-list by default. Mysql FULLTEXT for example even needs to
> be recompiled if you want to change them. 
This is a massive, massive drawback.  For web-apps on shared hosts, in 
the past I've had to resort to appending characters to each word to 
evade stop-word and minimum-length filtering, precisely because of this 
inane default, and you can imagine what that does to performance.

> I also want to argue that the
> use of stopwords is very common. 
That doesn't make it correct.  I see enough queries on this list alone 
from people surprised by the stop-word behaviour, or needing to change 
it because they need to support a language other than English, to 
believe that they should be dropped by default.

> For example, if I have an index of
> 1.000 english documents and search for 'and', chances are high that I
> get a result set of 1000 hits - which is unusable. 
So what?  The inverse isn't usable either - if 'and' is a stop-word, and 
you only search for 'and', you'll get no results at all.

> Stopwords are more of a result than an performance optimization.
That's just not the case - stop-words exist primarily to reduce the 
index size.  Their effect on the result set is a product of the way you 
construct a stop-word list - by picking the words which impart the 
smallest amounts of information.

> I cannot find it at the moment, but there was the point that 'premature'
> optimization is bad. This may be wise for your own application, but the
> libraries in use should be a) mature and b) optimized.
I believe that point was mine.  However, I was not referring to 
performance - traditionally stop-words have been used as a storage space 
reduction strategy, with typical results being a reduction in index size 
of between 20 and 30 percent.  There may well be a correlated 
performance bump, but that's tangential.

I'm not arguing that stop-words should not be available if you want 
them.  I'm not even arguing against supplying a decent set of stop-words 
for as many different languages as possible.  I am trying to argue that 
they should not be turned on by default.

-- 
Alex
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to