I would partially agree with Walter - having more resources allows us to
include stopwords in index and let scoring model do its job. However,
there are other Solr features that can suffer from that approach: e.g.
if you use edismax and mm=80%, in case of query with stopwords, you can
end up wi
I recommend that you remove StopFilterFactor from every analysis chain.
In the tf.idf scoring model, rare words are automatically weighted more than
common words.
I have an index with 11.6 million documents. “the” occurs in 9.9 million of
those documents. “cat” occurs in 16,000 of those documen
Hi Walter and all. Sorry for the late reply, I was out of town.
Are you saying the list of stop words from the stop word file be remove? I
understand the issues I will run into because of the stop word list, but
all alone, my understanding of stop word list being in the stop word file
is -- to e
Do not remove stop words. Want to search for “vitamin a”? That won’t work.
Stop word removal is a hack left over from when we were running search engines
in 64 kbytes of memory.
Yes, common words are less important for search, but removing them is a brute
force approach with severe side effects
Thanks Shawn. This is the best answer I have seen, much appreciated.
A follow up question, I want to remove stop words from the list, but if I
do, then search quality will degradation (and index size will grow (less of
an issue)). For example, if I remove "a", then if someone search for "For
a F
On 8/27/2016 12:39 PM, Shawn Heisey wrote:
> I personally think that stopword removal is more of a problem than a
> solution.
There actually is one thing that a stopword filter can dothat has little
to do with the purpose it was designed for. You can make it impossible
to search for certain words
On 8/26/2016 7:13 AM, Steven White wrote:
> But what about the current "default" list that comes with Solr? How was
> that list, for all supported languages, determined?
That list of stopwords was created from years of history with Lucene,
taking the expertise of many people and the wisdom of the
But what about the current "default" list that comes with Solr? How was
that list, for all supported languages, determined?
What I fear is this, when someone puts Solr into production, no one makes a
change to that list, so if the list is not "valid" this will impacting
search, but if the list is
Hi Steven,
List of Stopwords of a language are not fixed, there is no single universal
list of stop words used by all natural language processing tools .
Ideally stop words should be defined search merchandisers based on their domain
instead of referring default.
https://en.wikipedia.org/wiki/S