Do not remove stop words. Want to search for “vitamin a”? That won’t work.

Stop word removal is a hack left over from when we were running search engines 
in 64 kbytes of memory.

Yes, common words are less important for search, but removing them is a brute 
force approach with severe side effects. Instead, we use a proportional 
approach with the tf.idf model. That puts a higher weight on rare words and a 
lower weight on common words.

For some real-life examples of problems with stop words, you can read the list 
of movie titles that disappear with stemming and stop words. I discovered these 
when I was running search at Netflix.

        • Being There (this is the first one I noticed)
        • To Be and To Have (Être et Avoir)
        • To Have and To Have Not
        • Once and Again
        • To Be or Not To Be (1942) (OK, it isn’t just a quote from Hamlet)
        • To Be or Not To Be (1983)
        • Now and Then, Here and There
        • Be with Me
        • I’ll Be There
        • It Had to Be You
        • You Should Not Be Here
        • You Are Here

https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 29, 2016, at 5:39 PM, Steven White <swhite4...@gmail.com> wrote:
> 
> Thanks Shawn.  This is the best answer I have seen, much appreciated.
> 
> A follow up question, I want to remove stop words from the list, but if I
> do, then search quality will degradation (and index size will grow (less of
> an issue)).  For example, if I remove "a", then if someone search for "For
> a Few Dollars More" (without quotes) chances are good records with "a" will
> land higher up that are not relevant to user's search.  How can I address
> this?  Can I setup my schema so that records that get hits against a list
> of words, let's say off the stop word list, are ranked lower?
> 
> Steve
> 
> On Sat, Aug 27, 2016 at 2:53 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> 
>> On 8/27/2016 12:39 PM, Shawn Heisey wrote:
>>> I personally think that stopword removal is more of a problem than a
>>> solution.
>> 
>> There actually is one thing that a stopword filter can dothat has little
>> to do with the purpose it was designed for.  You can make it impossible
>> to search for certain words.
>> 
>> Imagine that your original data contains the word "frisbee" but for some
>> reason you do not want anybody to be able to locate results using that
>> word.  You can create a stopword list containing just "frisbee" and any
>> other variations that you want to limit like "frisbees", then place it
>> as a filter on the index side of your analysis.  With this in place,
>> searching for those terms will retrieve zero results.
>> 
>> Thanks,
>> Shawn
>> 
>> 

Reply via email to