Re: Collapsing similar queries

Jack Krupansky Fri, 19 Jul 2013 11:35:30 -0700

For starters, I think you need to elaborate your criteria for "queries thatcan be collapsed". You can say "they're similar", but then that begs thequestions of: 1) How to measure similarity, and 2) What threshold level ofsimilarity to use for "ok to collapse".


Two measures of similarity to consider:


1. How many top results do they have in common?

2. How many top terms and phrases from their top results do they have incommon.

Maybe, ultimately, some arbitrary heuristic is good enough, say usingediting distance for the raw query text. Or some adjusted editing distance.Or editing distance of the top terms of the top documents. Or, simply ANYheuristic that simple seems to both discriminate on differences and combineon similarities.


Here's a test case: query set

1. Office
2. The Office
3. Official
4. Office release
5. Official release
6. Office DVD

There are three distinct groups there.

If you have a specific, narrow domain in mind, a thesaurus of concepts andsynonyms for that domain would help you a lot.


-- Jack Krupansky

-----Original Message-----From: Otis Gospodnetic

Sent: Friday, July 19, 2013 12:33 PM
To: solr-user@lucene.apache.org
Subject: Collapsing similar queries

Hi,

Are there any known good tools or approaches to "collapsing queries".
For example, imagine 4 original queries:
* big house
* big houses
* the big house
* bigger house

...and all 4 being reduced/collapsed to just "big house".

What might be some good approached for doing this?
1) stem them all and collapse if the are identical
2) compute levenstein distance and collapse if they are close enough

Maybe also remove stop words from them first? (not so good for queries
consisting of all or lots of stop words, like "to be or not to be")

Any better approaches?

Thanks,
Otis
--
Solr & ElasticSearch Support -- http://sematext.com/

Performance Monitoring -- http://sematext.com/spm

Re: Collapsing similar queries

Reply via email to