Do not remove stop words. Want to search for “vitamin a”? That won’t work.
Stop word removal is a hack left over from when we were running search engines in 64 kbytes of memory. Yes, common words are less important for search, but removing them is a brute force approach with severe side effects. Instead, we use a proportional approach with the tf.idf model. That puts a higher weight on rare words and a lower weight on common words. For some real-life examples of problems with stop words, you can read the list of movie titles that disappear with stemming and stop words. I discovered these when I was running search at Netflix. • Being There (this is the first one I noticed) • To Be and To Have (Être et Avoir) • To Have and To Have Not • Once and Again • To Be or Not To Be (1942) (OK, it isn’t just a quote from Hamlet) • To Be or Not To Be (1983) • Now and Then, Here and There • Be with Me • I’ll Be There • It Had to Be You • You Should Not Be Here • You Are Here https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/ wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Aug 29, 2016, at 5:39 PM, Steven White <swhite4...@gmail.com> wrote: > > Thanks Shawn. This is the best answer I have seen, much appreciated. > > A follow up question, I want to remove stop words from the list, but if I > do, then search quality will degradation (and index size will grow (less of > an issue)). For example, if I remove "a", then if someone search for "For > a Few Dollars More" (without quotes) chances are good records with "a" will > land higher up that are not relevant to user's search. How can I address > this? Can I setup my schema so that records that get hits against a list > of words, let's say off the stop word list, are ranked lower? > > Steve > > On Sat, Aug 27, 2016 at 2:53 PM, Shawn Heisey <apa...@elyograg.org> wrote: > >> On 8/27/2016 12:39 PM, Shawn Heisey wrote: >>> I personally think that stopword removal is more of a problem than a >>> solution. >> >> There actually is one thing that a stopword filter can dothat has little >> to do with the purpose it was designed for. You can make it impossible >> to search for certain words. >> >> Imagine that your original data contains the word "frisbee" but for some >> reason you do not want anybody to be able to locate results using that >> word. You can create a stopword list containing just "frisbee" and any >> other variations that you want to limit like "frisbees", then place it >> as a filter on the index side of your analysis. With this in place, >> searching for those terms will retrieve zero results. >> >> Thanks, >> Shawn >> >>