This technique was used at Infoseek in 1996, and is very effective.

It also gives a relevance improvement, because you have an estimate
of IDF for phrases (exact for two-word phrases). The terms "the" and
"who" will be very common, but "the who" is quite rare and will have
a big IDF.

wunder

On 11/24/08 10:31 AM, "Burton-West, Tom" <[EMAIL PROTECTED]> wrote:

> Hello all,
> 
> We are having problems with extremely slow phrase queries when the
> phrase query contains a common words. We are reluctant to just use stop
> words due to various problems with false hits and some things becoming
> impossible to search with stop words turned on. (For example "to be or
> not to be", "the who", "man in the moon" vs "man on the moon" etc.)
> 
> The approach to this problem used by Nutch looks promising.  Has anyone
> ported the Nutch CommonGrams filter to Solr?
> 
> "Construct n-grams for frequently occuring terms and phrases while
> indexing. Optimize phrase queries to use the n-grams. Single terms are
> still indexed too, with n-grams overlaid."
> http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/C
> ommonGrams.html
> 
> 
> Tom
> 
> Tom Burton-West
> Information Retrieval Programmer
> Digital Library Production Services
> University of Michigan Library

Reply via email to