Hi all, i'm new to lucene and tried to write my own analyzer to support hyphenated words like wi-fi, jean-pierre, etc. For our customer it is important to find the word - wi-fi by wi, fi, wifi, wi-fi - jean-pierre by jean, pierre, jean-pierre, jean-*
The analyzer: public class SupportHyphenatedWordsAnalyzer extends Analyzer { protected NormalizeCharMap charConvertMap; public MinLuceneAnalyzer() { initCharConvertMap(); } protected void initCharConvertMap() { NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder(); builder.add("\"", ""); charConvertMap = builder.build(); } @Override protected TokenStreamComponents createComponents(final String fieldName) { final Tokenizer src = new WhitespaceTokenizer(); TokenStream tok = new WordDelimiterFilter(src, WordDelimiterFilter.PRESERVE_ORIGINAL | WordDelimiterFilter.GENERATE_WORD_PARTS | WordDelimiterFilter.GENERATE_NUMBER_PARTS | WordDelimiterFilter.CATENATE_WORDS, null); tok = new LowerCaseFilter(tok); tok = new LengthFilter(tok, 1, 255); tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET); return new TokenStreamComponents(src, tok); } @Override protected Reader initReader(String fieldName, Reader reader) { return new MappingCharFilter(charConvertMap, reader); } } The analyzer seems to work except for exact phrase match queries. e.g. the following words are indexed FD-A320-REC-SIM-1 FD-A320-REC-SIM-10 FD-A320-REC-SIM-11 MIA-FD-A320-REC-SIM-1 SIN-FD-A320-REC-SIM-1 The (exact) query "FD-A320-REC-SIM-1" returns FD-A320-REC-SIM-1 MIA-FD-A320-REC-SIM-1 SIN-FD-A320-REC-SIM-1 for our customer this is wrong because this exact phrase match query should only return the single entry FD-A320-REC-SIM-1 Do you have any ideas or tips, how we have to change our current analyzer to support this requirement??? Thanks and Kind regards Diego