This technique was used at Infoseek in 1996, and is very effective. It also gives a relevance improvement, because you have an estimate of IDF for phrases (exact for two-word phrases). The terms "the" and "who" will be very common, but "the who" is quite rare and will have a big IDF.
wunder On 11/24/08 10:31 AM, "Burton-West, Tom" <[EMAIL PROTECTED]> wrote: > Hello all, > > We are having problems with extremely slow phrase queries when the > phrase query contains a common words. We are reluctant to just use stop > words due to various problems with false hits and some things becoming > impossible to search with stop words turned on. (For example "to be or > not to be", "the who", "man in the moon" vs "man on the moon" etc.) > > The approach to this problem used by Nutch looks promising. Has anyone > ported the Nutch CommonGrams filter to Solr? > > "Construct n-grams for frequently occuring terms and phrases while > indexing. Optimize phrase queries to use the n-grams. Single terms are > still indexed too, with n-grams overlaid." > http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/C > ommonGrams.html > > > Tom > > Tom Burton-West > Information Retrieval Programmer > Digital Library Production Services > University of Michigan Library