I have technical data which I am querying with Lucene; one of the features
of the content is that a large number of technical terms may be written as
multiple words or as a compound word. For example, ISOWEEK or ISO WEEK. Or
SynonymFilter or synonym filter.

 

I have a synonym table which includes all of these phrases, thus isoweek=iso
week, isoyear=iso year, etc.

 

My understanding is that including the synonyms (with a SynonymFilter in my
analyzer) at index time means that I shouldn't have to include the synonym
filter in the query analyzer because if any of the synonyms appear in a
query they will match records containing any of the synonymous terms, as all
values are indexed for any one of them.

 

Checking with Luke, this appears to be the case, however the queries are not
matching all the records I expect them too, so I am taking a deeper look.

 

In the indexing phase, input text is tokenised on whitespace and
punctuation, lowercased, and then processed by a synonym filter. The
relevant part of the analyzer is this:

 

   @Override

   protected TokenStreamComponents createComponents(String fieldName) {

      WhitespaceTokenizer src = new WhitespaceTokenizer();

      TokenStream result = new TechTokenFilter( new LowerCaseFilter(src));

       result = new SynonymGraphFilter(result,
getSynonyms(options.getSynonymsList()), Boolean.TRUE);

       result = new FlattenGraphFilter(result);

      }

      return new TokenStreamComponents(src, result);

 

The getSynonyms method builds a synonym map from a comma-delimited text file
and I know this is working because all the one-word synonym replacements
index and search perfectly. The problem I have is with synonym phrases.

 

So if the synonyms input file contains

 

  isoweek,isodate

 

then (using Luke) I can see that any document containing either 'isoweek' or
'isodate' has indexed both terms, and a search with either term returns
matching results for both. Great.

 

However if the input file contains

 

  isoweek,iso week

 

then (again using Luke) I can see that while any document containing
'isoweek' has indexed the terms 'isoweek', 'iso' and 'week', unfortunately
any document containing 'iso week' has only indexed 'iso' and 'week'.

 

Am I chasing the impossible here? Is there something I can do in the query
analyzer to make it work? (Currently the query analyzer is the same as the
indexing analyzer with the synonymgraphfilter and flattengraphfilter
omitted.) Or do I have to manually pre-process any query to include OR
options for all phrase synonyms?

 

I haven't produced a small test case for this because I'm hoping a high
level discussion is all I need to put me on the right track.

 

cheers

T

 

 

 

 

 

Reply via email to