Re: SOLR Thesaurus

Chris Hostetter Fri, 10 Dec 2010 12:39:14 -0800

: My imaginative use case:
: - the user enters a term and maybe he turns on a flag to get not just
: the term, but all terms, which related somehow with this (usually the
: synonyms and narrower terms).
: - Solr first find the queried term(s) in the thesaurus, then finds the
: related terms, modifies and issues the query
: e.g. query is fruits, and it becames (fruit OR apple OR banana OR ...)
: 
: This use case is different from the synonym handler, which - as far as
: I know - modifies the index, and injects synonyms at the position of
: the original word. My use case suppose, that we maintain thesaurus as
: a different "database" (maybe another Solr index).


the use case you describe *could* be solved using the SynonymFilter -- you 
can configure it to be used at query time (for query expansion) *or* you 
can configure it to be used at index time (for reduction or expansion)

just express your thesaurus in the synonyms.txt format and configure it in 
your schema.xml

The two gotcha's to watch out for with this kind of appoach is multiword 
synonyms and the way Lucene's QueryParser treats whitespace as a 
metacharacter.

in general, if you're going to do this kind of major query expantion, you 
probably wnat to use something like the "FieldQParser" which doesn't treat 
whitespace as special so user input like...
        United States
...makes it to hte analyzer as one chunk of text, and can be looked up as 
is in your thesaurus.

The multiword synonym issue is more complicated - i don't have the energy 
to fully explain it right now, but for query time expansion it can be a 
real pain in the ass.  one word arround is to index shingle-esque terms 
instead of hte individual words in your synonyms, but that defeats the 
point of your goal of having an external thesarus that can be modified 
independently of the index.

My suggestion would be to write a simple little ThesarusQParser, that can 
use and instance of the SynonymFilter directly to preprocess the input 
text to get a list of all the Related Terms, and then delegate to another 
QParser to generate an appropate Query for each of them (typically a 
PhraseQuery) which your ThesarusQParser would then combine into a giant 
BooleanQuery (except you may wnat to consider a DisjunctionMaxQuery 
instead because of the scoring factors)

ThesarusQParser would require very little code, because SynonymFilter 
would be doing all the hard work.


-Hoss

Re: SOLR Thesaurus

Reply via email to