I am using Lucene 8.6.3 in an application which searches a library of technical documentation. I have implemented synonym matching which works for single word replacements, but does not match when one of the synonyms has two or more words. My attempts to support multi-term synonyms are failing, and although I'm sure one of the reasons is that I don't really know what I'm doing, it hasn't helped that this seems to be an area where regular changes in the Lucene implementation have occurred and a lot of the examples on the web are out of date.
I have a custom TechTokenFilter which splits input on whitespace into words, numbers, and individual characters like + and :. I have a custom analyzer which chains a WhitespaceTokenizer, a LowerCaseFilter, my TechTokenFilter, a SynonymGraphFilter and a FlattenGraphFilter. The full analyzer is used when indexing the documents, while the final two filters are dropped when querying the index. The SynonymGraphFilter loads synonyms from a text file which contains multiple lines: word,synonym[,synonym .] For example analyze,analyse chart,graph,plot enquire,inquire While I'm sure this code could be improved the gist of it is as follows (I've left try/catch and validation code and the like out to save space): /* ******************* CODE *********************** */ /* Analyzer and Synonym Map */ public class TechAnalyzer extends Analyzer { public TechAnalyzer(Options opts) { this.options = opts; // opts is a class which says if we are searching or querying, and where the synonym list is } Override protected TokenSTreamComponents createComponents(String fieldname) { WhitespaceTokenizer src = new WhitespaceTokenizer(); TokenStream result = new TechTokenFilter(new LowerCaseFilter(src)); if (options.indexing) { result = new SynonymGraphFilter(result, getSynonyms(options.synonymList), true); result = new FlattenGraphFilter(result); } return new TokenStreamComponents(src, result); } private static SynonymMap getSynonyms(String synlist) { boolean dedup = true; SynonymMap synMap = null; SynonymMap.Builder builder = new SynonymMap.Builder(dedup); BufferedReader br = new BufferedReader(new FileReader(synlist)); String line; while ((line = br.readLine()) != null) { processLine(builder, line); } br.close(); synMap = builder.build(); return synMap; } private static void processLine(SynonymMap.Builder builder, String line) { Boolean includeOrig = true; String terms[] = line.split(","); String word = terms[0]; String[] synonymsOfWord = Arrays.copyOfRange(terms, 1, terms.length); for (String synonym : synonymsOfWord) { addSyn(builder, word, synonym, includeOrig); } } private static void addSyn(SynonymMap.Builder builder, String word, String synonym, boolean includeOrig) { CharsRef syn = SynonymMap.Builder.join(synonym.split("]]s+"), new CharsRefBuilder()); builder.add(new CharsRef(word), syn, includeOrig); } private Options options; } /* Building the index, opts.indexing = true */ Analyzer analyzer = new TechAnalyzer(); IndexWriterConfig iwc = new IndexWriterConfig(analyzer); iwc.setOpenMode(OpenMode.CREATE); etc. /* Searching the index, opts.indexing = false */ IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(index))); IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new TechAnalyzer(); QueryParser parser = new QueryParser("text", analyzer); parser.setDefaultOperator(QueryParserBase.AND_OPERATOR); String line = inp.readLine(); Query query = parser.parse(line); TopDocs results = searcher.search(query, 100); ScoreDoc[] hits = results.scoreDocs; etc. /* ******************* END CODE *********************** */ While this works perfectly with simple word synonyms (such as those given as examples above), it fails to match documents when the synonym list includes phrases rather than words, although the code to construct the synonym map appears (to my eyes) to be creating it correctly. Thus if I add something like infantry,footsoldier,foot soldier then a search for "infantry" will match "footsoldier" but not "foot soldier", and a search for "footsoldier" will match "infantry" but not "foot soldier", and a search for "foot soldier" will match "foot soldier" but not "footsoldier" or "infantry". I expect I have to do something more sophisticated than just using a QueryParser on a list of one or more words, but what and how? Cheers T