Re: Default stop word list

Walter Underwood Thu, 08 Sep 2016 09:22:06 -0700

I recommend that you remove StopFilterFactor from every analysis chain.

In the tf.idf scoring model, rare words are automatically weighted more than 
common words.


I have an index with 11.6 million documents. “the” occurs in 9.9 million of 
those documents. “cat” occurs in 16,000 of those documents. (I just did 
searches to get the counts).

This is the idf (inverse document frequency) formula for Solr:

public float idf(int docFreq, int numDocs) {
    return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);
  }
“the” has an idf of 1.07. “cat” has an idf of 3.86.

The term “the” still counts for relevance, but it is dominated by the weight 
for “cat”.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 8, 2016, at 7:09 AM, Steven White <swhite4...@gmail.com> wrote:
> 
> Hi Walter and all.  Sorry for the late reply, I was out of town.
> 
> Are you saying the list of stop words from the stop word file be remove?  I
> understand the issues I will run into because of the stop word list, but
> all alone, my understanding of stop word list being in the stop word file
> is -- to eliminate them from being indexed -- is so that relevancy ranking
> is improved.  For example, if I index the word "the" instead of removing it
> than when I send the search term "the cat" (without quotes) than records
> with "the" will rank far higher vs. records with "cat" in my result set.
> In fact records with "cat" may not even be on the first page.  Wasn't this
> was stop word list created?
> 
> If my understanding is correct, is there a way for me to rank lower records
> that have a hit due to a list of common words, such as stop words?  This
> way: (1) I can than get rid of all the stop word list in the stop word
> file, (2) solve the issue of searching on "be with me", et. al., and (3)
> prevent the ranking issue.
> 
> Steve
> 
> On Mon, Aug 29, 2016 at 9:18 PM, Walter Underwood <wun...@wunderwood.org>
> wrote:
> 
>> Do not remove stop words. Want to search for “vitamin a”? That won’t work.
>> 
>> Stop word removal is a hack left over from when we were running search
>> engines in 64 kbytes of memory.
>> 
>> Yes, common words are less important for search, but removing them is a
>> brute force approach with severe side effects. Instead, we use a
>> proportional approach with the tf.idf model. That puts a higher weight on
>> rare words and a lower weight on common words.
>> 
>> For some real-life examples of problems with stop words, you can read the
>> list of movie titles that disappear with stemming and stop words. I
>> discovered these when I was running search at Netflix.
>> 
>>        • Being There (this is the first one I noticed)
>>        • To Be and To Have (Être et Avoir)
>>        • To Have and To Have Not
>>        • Once and Again
>>        • To Be or Not To Be (1942) (OK, it isn’t just a quote from Hamlet)
>>        • To Be or Not To Be (1983)
>>        • Now and Then, Here and There
>>        • Be with Me
>>        • I’ll Be There
>>        • It Had to Be You
>>        • You Should Not Be Here
>>        • You Are Here
>> 
>> https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Aug 29, 2016, at 5:39 PM, Steven White <swhite4...@gmail.com> wrote:
>>> 
>>> Thanks Shawn.  This is the best answer I have seen, much appreciated.
>>> 
>>> A follow up question, I want to remove stop words from the list, but if I
>>> do, then search quality will degradation (and index size will grow (less
>> of
>>> an issue)).  For example, if I remove "a", then if someone search for
>> "For
>>> a Few Dollars More" (without quotes) chances are good records with "a"
>> will
>>> land higher up that are not relevant to user's search.  How can I address
>>> this?  Can I setup my schema so that records that get hits against a list
>>> of words, let's say off the stop word list, are ranked lower?
>>> 
>>> Steve
>>> 
>>> On Sat, Aug 27, 2016 at 2:53 PM, Shawn Heisey <apa...@elyograg.org>
>> wrote:
>>> 
>>>> On 8/27/2016 12:39 PM, Shawn Heisey wrote:
>>>>> I personally think that stopword removal is more of a problem than a
>>>>> solution.
>>>> 
>>>> There actually is one thing that a stopword filter can dothat has little
>>>> to do with the purpose it was designed for.  You can make it impossible
>>>> to search for certain words.
>>>> 
>>>> Imagine that your original data contains the word "frisbee" but for some
>>>> reason you do not want anybody to be able to locate results using that
>>>> word.  You can create a stopword list containing just "frisbee" and any
>>>> other variations that you want to limit like "frisbees", then place it
>>>> as a filter on the index side of your analysis.  With this in place,
>>>> searching for those terms will retrieve zero results.
>>>> 
>>>> Thanks,
>>>> Shawn
>>>> 
>>>> 
>> 
>>

Re: Default stop word list

Reply via email to