For starters, I think you need to elaborate your criteria for "queries that
can be collapsed". You can say "they're similar", but then that begs the
questions of: 1) How to measure similarity, and 2) What threshold level of
similarity to use for "ok to collapse".
Two measures of similarity to consider:
1. How many top results do they have in common?
2. How many top terms and phrases from their top results do they have in
common.
Maybe, ultimately, some arbitrary heuristic is good enough, say using
editing distance for the raw query text. Or some adjusted editing distance.
Or editing distance of the top terms of the top documents. Or, simply ANY
heuristic that simple seems to both discriminate on differences and combine
on similarities.
Here's a test case: query set
1. Office
2. The Office
3. Official
4. Office release
5. Official release
6. Office DVD
There are three distinct groups there.
If you have a specific, narrow domain in mind, a thesaurus of concepts and
synonyms for that domain would help you a lot.
-- Jack Krupansky
-----Original Message-----
From: Otis Gospodnetic
Sent: Friday, July 19, 2013 12:33 PM
To: solr-user@lucene.apache.org
Subject: Collapsing similar queries
Hi,
Are there any known good tools or approaches to "collapsing queries".
For example, imagine 4 original queries:
* big house
* big houses
* the big house
* bigger house
...and all 4 being reduced/collapsed to just "big house".
What might be some good approached for doing this?
1) stem them all and collapse if the are identical
2) compute levenstein distance and collapse if they are close enough
Maybe also remove stop words from them first? (not so good for queries
consisting of all or lots of stop words, like "to be or not to be")
Any better approaches?
Thanks,
Otis
--
Solr & ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm