For starters, I think you need to elaborate your criteria for "queries that can be collapsed". You can say "they're similar", but then that begs the questions of: 1) How to measure similarity, and 2) What threshold level of similarity to use for "ok to collapse".

Two measures of similarity to consider:

1. How many top results do they have in common?
2. How many top terms and phrases from their top results do they have in common.

Maybe, ultimately, some arbitrary heuristic is good enough, say using editing distance for the raw query text. Or some adjusted editing distance. Or editing distance of the top terms of the top documents. Or, simply ANY heuristic that simple seems to both discriminate on differences and combine on similarities.

Here's a test case: query set

1. Office
2. The Office
3. Official
4. Office release
5. Official release
6. Office DVD

There are three distinct groups there.

If you have a specific, narrow domain in mind, a thesaurus of concepts and synonyms for that domain would help you a lot.

-- Jack Krupansky
-----Original Message----- From: Otis Gospodnetic
Sent: Friday, July 19, 2013 12:33 PM
To: solr-user@lucene.apache.org
Subject: Collapsing similar queries

Hi,

Are there any known good tools or approaches to "collapsing queries".
For example, imagine 4 original queries:
* big house
* big houses
* the big house
* bigger house

...and all 4 being reduced/collapsed to just "big house".

What might be some good approached for doing this?
1) stem them all and collapse if the are identical
2) compute levenstein distance and collapse if they are close enough

Maybe also remove stop words from them first? (not so good for queries
consisting of all or lots of stop words, like "to be or not to be")

Any better approaches?

Thanks,
Otis
--
Solr & ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm

Reply via email to