Re: Analyzer for supporting hyphenated words

Diego Socaceti Wed, 22 Jul 2015 03:29:58 -0700

Hi Alessandro,

sorry, that i forgot the important part. Here it is:


...

public static final String EXACT_SEARCH_FORMAT = "\"%s\"";
public static final String MULTIPLE_CHARACTER_WILDCARD = "*";

...

  if (isExactCriteriaString(userCriteria)) {
    String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
        escape(userCriteria.substring(1, userCriteria.length() - 1)));
    userCriteriaProcessed = userCriteriaEscaped;
  } else {
    userCriteriaProcessed = escape(userCriteria);

    if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
      userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
    }
  }


  String queryStr = "";

  for (String fieldName : fields) {
    String escapedFieldName = escape(fieldName);
    queryStr += String.format("%s:%s ", escapedFieldName,
curTokenProcessed);
  }

  query = new QueryParser("", analyzer).parse(queryStr.trim());

...


As far as i understand my problem is, that in my - naive query syntax based
solution -
i have to use my analyzer, which means that the userCriteria is always
tokenized.

You suggest to use the java query classes to build the query, because than
i can
control if the userCriteria will be tokenized or not.
Did i get you right?


Thanks and Kind regards

On Wed, Jul 22, 2015 at 11:44 AM, Alessandro Benedetti <
[email protected]> wrote:

> I read briefly, correct me if I am wrong, but that is to parse the content
> within the quotes " .
> But we are still at a String level.
> I want to see how you build the phraseQuery :)
> Taking a look to the code the PhraseQuery allow you to add as many terms
> you want.
>
> What you need to do, it's to not tokenise the content within the quotes and
> create actually a TermQuery ( in your case you are not even using the
> feature offered by the phrase query regarding positions, you simply want to
> run a TermQuery) .
>
> So to clarify you should parse the content within the quotes ( as you are
> doing), than building a TermQuery out of that String, not tokenized at all.
>
> Does this make sense to you ?
> Can I see what you do after identifying the content within the quotes ?
>
> Cheers
>
>
> 2015-07-22 10:20 GMT+01:00 Diego Socaceti <[email protected]>:
>
> > Hi Alessandro,
> >
> > i guess code says more than worlds :)
> >
> > ...
> >
> > public static final String EXACT_SEARCH_FORMAT = "\"%s\"";
> > public static final String MULTIPLE_CHARACTER_WILDCARD = "*";
> >
> > ...
> >
> >   if (isExactCriteriaString(userCriteria)) {
> >     String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
> >         escape(userCriteria.substring(1, userCriteria.length() - 1)));
> >     userCriteriaProcessed = userCriteriaEscaped;
> >   } else {
> >     userCriteriaProcessed = escape(userCriteria);
> >
> >     if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
> >       userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
> >     }
> >   }
> >
> > ...
> >
> > public static String escape(String s) {
> >   String result = s;
> >
> >   if (s != null && !s.trim().isEmpty()) {
> >     String toEscape = s.trim();
> >
> >     if (toEscape.contains("*")) {
> >       StringBuilder sb = new StringBuilder();
> >
> >       for (int i = 0; i < toEscape.length(); i++) {
> >         char curChar = toEscape.charAt(i);
> >         if (curChar == '*')
> >           sb.append('*');
> >         else
> >           sb.append(QueryParser.escape(toEscape.substring(i, i + 1)));
> >       }
> >
> >       result = sb.toString();
> >     } else {
> >       result = QueryParser.escape(toEscape);
> >     }
> >   }
> >
> >   return result;
> > }
> >
> > ...
> >
> > Thanks and Kind regards
> >
> >
> >
> > On Wed, Jul 22, 2015 at 11:04 AM, Alessandro Benedetti <
> > [email protected]> wrote:
> >
> > > As a start Diego, how do you currently parse the user query to build
> the
> > > Lucene queries ?
> > >
> > > Cheers
> > >
> > > 2015-07-22 8:35 GMT+01:00 Diego Socaceti <[email protected]>:
> > >
> > > > Hi Alessandro,
> > > >
> > > > yes, i want the user to be able to surround the query with "" to run
> > the
> > > > phrase query with a NOT tokenized phrase.
> > > >
> > > > What do i have to do?
> > > >
> > > > Thanks and Kind regards
> > > >
> > > > On Tue, Jul 21, 2015 at 2:47 PM, Alessandro Benedetti <
> > > > [email protected]> wrote:
> > > >
> > > > > Hey Jack, reading the doc :
> > > > >
> > > > > " Set to true if phrase queries will be automatically generated
> when
> > > the
> > > > > analyzer returns more than one term from whitespace delimited text.
> > > NOTE:
> > > > > this behavior may not be suitable for all languages.
> > > > >
> > > > > Set to false if phrase queries should only be generated when
> > surrounded
> > > > by
> > > > > double quotes."
> > > > >
> > > > >
> > > > > In the user case , i guess he's likely to use double quotes.
> > > > >
> > > > > The only problem he sees so far is that the phrase query uses the
> > query
> > > > > time analyser to actually split the tokens.
> > > > >
> > > > > First we need a feedback from him, but I guess he would like to
> have
> > > the
> > > > > phrase query, to not tokenise the text within the double quotes.
> > > > >
> > > > > In the case we should find a way.
> > > > >
> > > > >
> > > > > Cheers
> > > > >
> > > > > 2015-07-21 13:12 GMT+01:00 Jack Krupansky <
> [email protected]
> > >:
> > > > >
> > > > > > If you don't explicitly enable automatic phrase queries, the
> Lucene
> > > > query
> > > > > > parser will assume an OR operator on the sub-terms when a white
> > > > > > space-delimited term analyzes into a sequence of terms.
> > > > > >
> > > > > > See:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean)
> > > > > >
> > > > > >
> > > > > > -- Jack Krupansky
> > > > > >
> > > > > > On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti <
> > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > i'm new to lucene and tried to write my own analyzer to support
> > > > > > > hyphenated words like wi-fi, jean-pierre, etc.
> > > > > > > For our customer it is important to find the word
> > > > > > > - wi-fi by wi, fi, wifi, wi-fi
> > > > > > > - jean-pierre by jean, pierre, jean-pierre, jean-*
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > The analyzer:
> > > > > > > public class SupportHyphenatedWordsAnalyzer extends Analyzer {
> > > > > > >
> > > > > > >   protected NormalizeCharMap charConvertMap;
> > > > > > >
> > > > > > >   public MinLuceneAnalyzer() {
> > > > > > >     initCharConvertMap();
> > > > > > >   }
> > > > > > >
> > > > > > >   protected void initCharConvertMap() {
> > > > > > >     NormalizeCharMap.Builder builder = new
> > > > NormalizeCharMap.Builder();
> > > > > > >     builder.add("\"", "");
> > > > > > >     charConvertMap = builder.build();
> > > > > > >   }
> > > > > > >
> > > > > > >   @Override
> > > > > > >   protected TokenStreamComponents createComponents(final String
> > > > > > fieldName)
> > > > > > > {
> > > > > > >
> > > > > > >     final Tokenizer src = new WhitespaceTokenizer();
> > > > > > >
> > > > > > >     TokenStream tok = new WordDelimiterFilter(src,
> > > > > > >         WordDelimiterFilter.PRESERVE_ORIGINAL
> > > > > > >             | WordDelimiterFilter.GENERATE_WORD_PARTS
> > > > > > >             | WordDelimiterFilter.GENERATE_NUMBER_PARTS
> > > > > > >             | WordDelimiterFilter.CATENATE_WORDS,
> > > > > > >         null);
> > > > > > >     tok = new LowerCaseFilter(tok);
> > > > > > >     tok = new LengthFilter(tok, 1, 255);
> > > > > > >     tok = new StopFilter(tok,
> > StopAnalyzer.ENGLISH_STOP_WORDS_SET);
> > > > > > >
> > > > > > >     return new TokenStreamComponents(src, tok);
> > > > > > >   }
> > > > > > >
> > > > > > >   @Override
> > > > > > >   protected Reader initReader(String fieldName, Reader reader)
> {
> > > > > > >     return new MappingCharFilter(charConvertMap, reader);
> > > > > > >   }
> > > > > > > }
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > The analyzer seems to work except for exact phrase match
> queries.
> > > > > > >
> > > > > > > e.g. the following words are indexed
> > > > > > >
> > > > > > > FD-A320-REC-SIM-1
> > > > > > > FD-A320-REC-SIM-10
> > > > > > > FD-A320-REC-SIM-11
> > > > > > > MIA-FD-A320-REC-SIM-1
> > > > > > > SIN-FD-A320-REC-SIM-1
> > > > > > >
> > > > > > >
> > > > > > > The (exact) query "FD-A320-REC-SIM-1" returns
> > > > > > > FD-A320-REC-SIM-1
> > > > > > > MIA-FD-A320-REC-SIM-1
> > > > > > > SIN-FD-A320-REC-SIM-1
> > > > > > >
> > > > > > > for our customer this is wrong because this exact phrase match
> > > > > > > query should only return the single entry FD-A320-REC-SIM-1
> > > > > > >
> > > > > > > Do you have any ideas or tips, how we have to change our
> current
> > > > > > > analyzer to support this requirement???
> > > > > > >
> > > > > > >
> > > > > > > Thanks and Kind regards
> > > > > > > Diego
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > --------------------------
> > > > >
> > > > > Benedetti Alessandro
> > > > > Visiting card - http://about.me/alessandro_benedetti
> > > > > Blog - http://alexbenedetti.blogspot.co.uk
> > > > >
> > > > > "Tyger, tyger burning bright
> > > > > In the forests of the night,
> > > > > What immortal hand or eye
> > > > > Could frame thy fearful symmetry?"
> > > > >
> > > > > William Blake - Songs of Experience -1794 England
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > --------------------------
> > >
> > > Benedetti Alessandro
> > > Visiting card - http://about.me/alessandro_benedetti
> > > Blog - http://alexbenedetti.blogspot.co.uk
> > >
> > > "Tyger, tyger burning bright
> > > In the forests of the night,
> > > What immortal hand or eye
> > > Could frame thy fearful symmetry?"
> > >
> > > William Blake - Songs of Experience -1794 England
> > >
> >
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

Re: Analyzer for supporting hyphenated words

Reply via email to