On 8/26/2016 7:13 AM, Steven White wrote: > But what about the current "default" list that comes with Solr? How was > that list, for all supported languages, determined?
That list of stopwords was created from years of history with Lucene, taking the expertise of many people and the wisdom of the Internet into account. > What I fear is this, when someone puts Solr into production, no one makes a > change to that list, so if the list is not "valid" this will impacting > search, but if the list is valid, how was it determined, just by the > development team of Solr / Lucene or input from linguistic expert? The list of stopwords that come with Solr is a *starting point*. The person who sets Solr up should review the list and adjust it to their needs ... or possibly remove the stopword filter entirely. I personally think that stopword removal is more of a problem than a solution. In the long forgotten days of history, when computers had far less processing power, storage, and memory than they do now ... removing stopwords was a significant performance advantage, because it made the indexes smaller. With typical modern server configurations and small to medium sized indexes, the performance benefit is minimal, and the removal can sometimes cause significant disadvantages. The classic example query related to stopwords (in English) is trying to search for "to be or not to be" -- a phrase made up of words that almost always appear in a stopword list, causing big problems. A more relevant example is searching an entertainment database for "the who". That search returns mostly irrelevant results when stopwords are removed. Imagine searching a music database for "the the" and not finding anything at all relating to this band: https://en.wikipedia.org/wiki/The_The Thanks, Shawn