On 8/26/2016 7:13 AM, Steven White wrote:
> But what about the current "default" list that comes with Solr?  How was
> that list, for all supported languages, determined?

That list of stopwords was created from years of history with Lucene,
taking the expertise of many people and the wisdom of the Internet into
account.

> What I fear is this, when someone puts Solr into production, no one makes a
> change to that list, so if the list is not "valid" this will impacting
> search, but if the list is valid, how was it determined, just by the
> development team of Solr / Lucene or input from linguistic expert?

The list of stopwords that come with Solr is a *starting point*.  The
person who sets Solr up should review the list and adjust it to their
needs ... or possibly remove the stopword filter entirely.

I personally think that stopword removal is more of a problem than a
solution.  In the long forgotten days of history, when computers had far
less processing power, storage, and memory than they do now ... removing
stopwords was a significant performance advantage, because it made the
indexes smaller.

With typical modern server configurations and small to medium sized
indexes, the performance benefit is minimal, and the removal can
sometimes cause significant disadvantages.

The classic example query related to stopwords (in English) is trying to
search for "to be or not to be" -- a phrase made up of words that almost
always appear in a stopword list, causing big problems.  A more relevant
example is searching an entertainment database for "the who".  That
search returns mostly irrelevant results when stopwords are removed. 
Imagine searching a music database for "the the" and not finding
anything at all relating to this band:

https://en.wikipedia.org/wiki/The_The

Thanks,
Shawn

Reply via email to