Re: Analyzer for supporting hyphenated words

Alessandro Benedetti Tue, 21 Jul 2015 05:48:07 -0700

Hey Jack, reading the doc :

" Set to true if phrase queries will be automatically generated when the
analyzer returns more than one term from whitespace delimited text. NOTE:
this behavior may not be suitable for all languages.


Set to false if phrase queries should only be generated when surrounded by
double quotes."


In the user case , i guess he's likely to use double quotes.

The only problem he sees so far is that the phrase query uses the query
time analyser to actually split the tokens.

First we need a feedback from him, but I guess he would like to have the
phrase query, to not tokenise the text within the double quotes.

In the case we should find a way.


Cheers

2015-07-21 13:12 GMT+01:00 Jack Krupansky <jack.krupan...@gmail.com>:

> If you don't explicitly enable automatic phrase queries, the Lucene query
> parser will assume an OR operator on the sub-terms when a white
> space-delimited term analyzes into a sequence of terms.
>
> See:
>
> https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean)
>
>
> -- Jack Krupansky
>
> On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti <socac...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > i'm new to lucene and tried to write my own analyzer to support
> > hyphenated words like wi-fi, jean-pierre, etc.
> > For our customer it is important to find the word
> > - wi-fi by wi, fi, wifi, wi-fi
> > - jean-pierre by jean, pierre, jean-pierre, jean-*
> >
> >
> >
> >
> > The analyzer:
> > public class SupportHyphenatedWordsAnalyzer extends Analyzer {
> >
> >   protected NormalizeCharMap charConvertMap;
> >
> >   public MinLuceneAnalyzer() {
> >     initCharConvertMap();
> >   }
> >
> >   protected void initCharConvertMap() {
> >     NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
> >     builder.add("\"", "");
> >     charConvertMap = builder.build();
> >   }
> >
> >   @Override
> >   protected TokenStreamComponents createComponents(final String
> fieldName)
> > {
> >
> >     final Tokenizer src = new WhitespaceTokenizer();
> >
> >     TokenStream tok = new WordDelimiterFilter(src,
> >         WordDelimiterFilter.PRESERVE_ORIGINAL
> >             | WordDelimiterFilter.GENERATE_WORD_PARTS
> >             | WordDelimiterFilter.GENERATE_NUMBER_PARTS
> >             | WordDelimiterFilter.CATENATE_WORDS,
> >         null);
> >     tok = new LowerCaseFilter(tok);
> >     tok = new LengthFilter(tok, 1, 255);
> >     tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
> >
> >     return new TokenStreamComponents(src, tok);
> >   }
> >
> >   @Override
> >   protected Reader initReader(String fieldName, Reader reader) {
> >     return new MappingCharFilter(charConvertMap, reader);
> >   }
> > }
> >
> >
> >
> >
> >
> > The analyzer seems to work except for exact phrase match queries.
> >
> > e.g. the following words are indexed
> >
> > FD-A320-REC-SIM-1
> > FD-A320-REC-SIM-10
> > FD-A320-REC-SIM-11
> > MIA-FD-A320-REC-SIM-1
> > SIN-FD-A320-REC-SIM-1
> >
> >
> > The (exact) query "FD-A320-REC-SIM-1" returns
> > FD-A320-REC-SIM-1
> > MIA-FD-A320-REC-SIM-1
> > SIN-FD-A320-REC-SIM-1
> >
> > for our customer this is wrong because this exact phrase match
> > query should only return the single entry FD-A320-REC-SIM-1
> >
> > Do you have any ideas or tips, how we have to change our current
> > analyzer to support this requirement???
> >
> >
> > Thanks and Kind regards
> > Diego
> >
>



-- 
--------------------------

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Analyzer for supporting hyphenated words

Reply via email to