Hi Diego, let me try to help : I find this a little bit confused :
"For our customer it is important to find the word - *wi-fi* by wi, *fi*, wifi, wi-fi - jean-pierre by jean, pierre, jean-pierre, jean-*" But : " The (exact) query "*FD-A320-REC-SIM-1*" returns FD-A320-REC-SIM-1 MIA-*FD-A320-REC-SIM-1* SIN-FD-A320-REC-SIM-1 for our customer this is wrong because this exact phrase match query should only return the single entry FD-A320-REC-SIM-1 " If you noticed the suffix "fi" in the first example can be compared to the suffix "FD-A320-REC-SIM-1" in the second. To qualify your requirement : Do you want the user to be able to surround the query with "" to run the phrase query with a NOT tokenized phrase ? Because by default , a phrase query is tokenized like the others, but term positions affect the matching ! In the case I identified your requirement, we can have a think to a solution! Cheers 2015-07-17 9:41 GMT+01:00 Diego Socaceti <socac...@gmail.com>: > Hi all, > > i'm new to lucene and tried to write my own analyzer to support > hyphenated words like wi-fi, jean-pierre, etc. > For our customer it is important to find the word > - wi-fi by wi, fi, wifi, wi-fi > - jean-pierre by jean, pierre, jean-pierre, jean-* > > > > > The analyzer: > public class SupportHyphenatedWordsAnalyzer extends Analyzer { > > protected NormalizeCharMap charConvertMap; > > public MinLuceneAnalyzer() { > initCharConvertMap(); > } > > protected void initCharConvertMap() { > NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder(); > builder.add("\"", ""); > charConvertMap = builder.build(); > } > > @Override > protected TokenStreamComponents createComponents(final String fieldName) > { > > final Tokenizer src = new WhitespaceTokenizer(); > > TokenStream tok = new WordDelimiterFilter(src, > WordDelimiterFilter.PRESERVE_ORIGINAL > | WordDelimiterFilter.GENERATE_WORD_PARTS > | WordDelimiterFilter.GENERATE_NUMBER_PARTS > | WordDelimiterFilter.CATENATE_WORDS, > null); > tok = new LowerCaseFilter(tok); > tok = new LengthFilter(tok, 1, 255); > tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET); > > return new TokenStreamComponents(src, tok); > } > > @Override > protected Reader initReader(String fieldName, Reader reader) { > return new MappingCharFilter(charConvertMap, reader); > } > } > > > > > > The analyzer seems to work except for exact phrase match queries. > > e.g. the following words are indexed > > FD-A320-REC-SIM-1 > FD-A320-REC-SIM-10 > FD-A320-REC-SIM-11 > MIA-FD-A320-REC-SIM-1 > SIN-FD-A320-REC-SIM-1 > > > The (exact) query "FD-A320-REC-SIM-1" returns > FD-A320-REC-SIM-1 > MIA-FD-A320-REC-SIM-1 > SIN-FD-A320-REC-SIM-1 > > for our customer this is wrong because this exact phrase match > query should only return the single entry FD-A320-REC-SIM-1 > > Do you have any ideas or tips, how we have to change our current > analyzer to support this requirement??? > > > Thanks and Kind regards > Diego > -- -------------------------- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England