Re: Analyzer for supporting hyphenated words

Diego Socaceti Wed, 22 Jul 2015 03:33:07 -0700

sorry little code refactoring typo: curTokenProcessed should be
userCriteriaProcessed


...

public static final String EXACT_SEARCH_FORMAT = "\"%s\"";
public static final String MULTIPLE_CHARACTER_WILDCARD = "*";

...

  if (isExactCriteriaString(userCriteria)) {
    String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
        escape(userCriteria.substring(1, userCriteria.length() - 1)));
    userCriteriaProcessed = userCriteriaEscaped;
  } else {
    userCriteriaProcessed = escape(userCriteria);

    if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
      userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
    }
  }


  String queryStr = "";

  for (String fieldName : fields) {
    String escapedFieldName = escape(fieldName);
    queryStr += String.format("%s:%s ", escapedFieldName,
userCriteriaProcessed);
  }

  query = new QueryParser("", analyzer).parse(queryStr.trim());

...

On Wed, Jul 22, 2015 at 12:27 PM, Diego Socaceti <[email protected]> wrote:

> Hi Alessandro,
>
> sorry, that i forgot the important part. Here it is:
>
> ...
>
> public static final String EXACT_SEARCH_FORMAT = "\"%s\"";
> public static final String MULTIPLE_CHARACTER_WILDCARD = "*";
>
> ...
>
>   if (isExactCriteriaString(userCriteria)) {
>     String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
>         escape(userCriteria.substring(1, userCriteria.length() - 1)));
>     userCriteriaProcessed = userCriteriaEscaped;
>   } else {
>     userCriteriaProcessed = escape(userCriteria);
>
>     if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
>       userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
>     }
>   }
>
>
>   String queryStr = "";
>
>   for (String fieldName : fields) {
>     String escapedFieldName = escape(fieldName);
>     queryStr += String.format("%s:%s ", escapedFieldName,
> curTokenProcessed);
>   }
>
>   query = new QueryParser("", analyzer).parse(queryStr.trim());
>
> ...
>
>
> As far as i understand my problem is, that in my - naive query syntax
> based solution -
> i have to use my analyzer, which means that the userCriteria is always
> tokenized.
>
> You suggest to use the java query classes to build the query, because than
> i can
> control if the userCriteria will be tokenized or not.
> Did i get you right?
>
>
> Thanks and Kind regards
>
> On Wed, Jul 22, 2015 at 11:44 AM, Alessandro Benedetti <
> [email protected]> wrote:
>
>> I read briefly, correct me if I am wrong, but that is to parse the content
>> within the quotes " .
>> But we are still at a String level.
>> I want to see how you build the phraseQuery :)
>> Taking a look to the code the PhraseQuery allow you to add as many terms
>> you want.
>>
>> What you need to do, it's to not tokenise the content within the quotes
>> and
>> create actually a TermQuery ( in your case you are not even using the
>> feature offered by the phrase query regarding positions, you simply want
>> to
>> run a TermQuery) .
>>
>> So to clarify you should parse the content within the quotes ( as you are
>> doing), than building a TermQuery out of that String, not tokenized at
>> all.
>>
>> Does this make sense to you ?
>> Can I see what you do after identifying the content within the quotes ?
>>
>> Cheers
>>
>>
>> 2015-07-22 10:20 GMT+01:00 Diego Socaceti <[email protected]>:
>>
>> > Hi Alessandro,
>> >
>> > i guess code says more than worlds :)
>> >
>> > ...
>> >
>> > public static final String EXACT_SEARCH_FORMAT = "\"%s\"";
>> > public static final String MULTIPLE_CHARACTER_WILDCARD = "*";
>> >
>> > ...
>> >
>> >   if (isExactCriteriaString(userCriteria)) {
>> >     String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
>> >         escape(userCriteria.substring(1, userCriteria.length() - 1)));
>> >     userCriteriaProcessed = userCriteriaEscaped;
>> >   } else {
>> >     userCriteriaProcessed = escape(userCriteria);
>> >
>> >     if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
>> >       userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
>> >     }
>> >   }
>> >
>> > ...
>> >
>> > public static String escape(String s) {
>> >   String result = s;
>> >
>> >   if (s != null && !s.trim().isEmpty()) {
>> >     String toEscape = s.trim();
>> >
>> >     if (toEscape.contains("*")) {
>> >       StringBuilder sb = new StringBuilder();
>> >
>> >       for (int i = 0; i < toEscape.length(); i++) {
>> >         char curChar = toEscape.charAt(i);
>> >         if (curChar == '*')
>> >           sb.append('*');
>> >         else
>> >           sb.append(QueryParser.escape(toEscape.substring(i, i + 1)));
>> >       }
>> >
>> >       result = sb.toString();
>> >     } else {
>> >       result = QueryParser.escape(toEscape);
>> >     }
>> >   }
>> >
>> >   return result;
>> > }
>> >
>> > ...
>> >
>> > Thanks and Kind regards
>> >
>> >
>> >
>> > On Wed, Jul 22, 2015 at 11:04 AM, Alessandro Benedetti <
>> > [email protected]> wrote:
>> >
>> > > As a start Diego, how do you currently parse the user query to build
>> the
>> > > Lucene queries ?
>> > >
>> > > Cheers
>> > >
>> > > 2015-07-22 8:35 GMT+01:00 Diego Socaceti <[email protected]>:
>> > >
>> > > > Hi Alessandro,
>> > > >
>> > > > yes, i want the user to be able to surround the query with "" to run
>> > the
>> > > > phrase query with a NOT tokenized phrase.
>> > > >
>> > > > What do i have to do?
>> > > >
>> > > > Thanks and Kind regards
>> > > >
>> > > > On Tue, Jul 21, 2015 at 2:47 PM, Alessandro Benedetti <
>> > > > [email protected]> wrote:
>> > > >
>> > > > > Hey Jack, reading the doc :
>> > > > >
>> > > > > " Set to true if phrase queries will be automatically generated
>> when
>> > > the
>> > > > > analyzer returns more than one term from whitespace delimited
>> text.
>> > > NOTE:
>> > > > > this behavior may not be suitable for all languages.
>> > > > >
>> > > > > Set to false if phrase queries should only be generated when
>> > surrounded
>> > > > by
>> > > > > double quotes."
>> > > > >
>> > > > >
>> > > > > In the user case , i guess he's likely to use double quotes.
>> > > > >
>> > > > > The only problem he sees so far is that the phrase query uses the
>> > query
>> > > > > time analyser to actually split the tokens.
>> > > > >
>> > > > > First we need a feedback from him, but I guess he would like to
>> have
>> > > the
>> > > > > phrase query, to not tokenise the text within the double quotes.
>> > > > >
>> > > > > In the case we should find a way.
>> > > > >
>> > > > >
>> > > > > Cheers
>> > > > >
>> > > > > 2015-07-21 13:12 GMT+01:00 Jack Krupansky <
>> [email protected]
>> > >:
>> > > > >
>> > > > > > If you don't explicitly enable automatic phrase queries, the
>> Lucene
>> > > > query
>> > > > > > parser will assume an OR operator on the sub-terms when a white
>> > > > > > space-delimited term analyzes into a sequence of terms.
>> > > > > >
>> > > > > > See:
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean)
>> > > > > >
>> > > > > >
>> > > > > > -- Jack Krupansky
>> > > > > >
>> > > > > > On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti <
>> > [email protected]>
>>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hi all,
>> > > > > > >
>> > > > > > > i'm new to lucene and tried to write my own analyzer to
>> support
>> > > > > > > hyphenated words like wi-fi, jean-pierre, etc.
>> > > > > > > For our customer it is important to find the word
>> > > > > > > - wi-fi by wi, fi, wifi, wi-fi
>> > > > > > > - jean-pierre by jean, pierre, jean-pierre, jean-*
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > The analyzer:
>> > > > > > > public class SupportHyphenatedWordsAnalyzer extends Analyzer {
>> > > > > > >
>> > > > > > >   protected NormalizeCharMap charConvertMap;
>> > > > > > >
>> > > > > > >   public MinLuceneAnalyzer() {
>> > > > > > >     initCharConvertMap();
>> > > > > > >   }
>> > > > > > >
>> > > > > > >   protected void initCharConvertMap() {
>> > > > > > >     NormalizeCharMap.Builder builder = new
>> > > > NormalizeCharMap.Builder();
>> > > > > > >     builder.add("\"", "");
>> > > > > > >     charConvertMap = builder.build();
>> > > > > > >   }
>> > > > > > >
>> > > > > > >   @Override
>> > > > > > >   protected TokenStreamComponents createComponents(final
>> String
>> > > > > > fieldName)
>> > > > > > > {
>> > > > > > >
>> > > > > > >     final Tokenizer src = new WhitespaceTokenizer();
>> > > > > > >
>> > > > > > >     TokenStream tok = new WordDelimiterFilter(src,
>> > > > > > >         WordDelimiterFilter.PRESERVE_ORIGINAL
>> > > > > > >             | WordDelimiterFilter.GENERATE_WORD_PARTS
>> > > > > > >             | WordDelimiterFilter.GENERATE_NUMBER_PARTS
>> > > > > > >             | WordDelimiterFilter.CATENATE_WORDS,
>> > > > > > >         null);
>> > > > > > >     tok = new LowerCaseFilter(tok);
>> > > > > > >     tok = new LengthFilter(tok, 1, 255);
>> > > > > > >     tok = new StopFilter(tok,
>> > StopAnalyzer.ENGLISH_STOP_WORDS_SET);
>> > > > > > >
>> > > > > > >     return new TokenStreamComponents(src, tok);
>> > > > > > >   }
>> > > > > > >
>> > > > > > >   @Override
>> > > > > > >   protected Reader initReader(String fieldName, Reader
>> reader) {
>> > > > > > >     return new MappingCharFilter(charConvertMap, reader);
>> > > > > > >   }
>> > > > > > > }
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > The analyzer seems to work except for exact phrase match
>> queries.
>> > > > > > >
>> > > > > > > e.g. the following words are indexed
>> > > > > > >
>> > > > > > > FD-A320-REC-SIM-1
>> > > > > > > FD-A320-REC-SIM-10
>> > > > > > > FD-A320-REC-SIM-11
>> > > > > > > MIA-FD-A320-REC-SIM-1
>> > > > > > > SIN-FD-A320-REC-SIM-1
>> > > > > > >
>> > > > > > >
>> > > > > > > The (exact) query "FD-A320-REC-SIM-1" returns
>> > > > > > > FD-A320-REC-SIM-1
>> > > > > > > MIA-FD-A320-REC-SIM-1
>> > > > > > > SIN-FD-A320-REC-SIM-1
>> > > > > > >
>> > > > > > > for our customer this is wrong because this exact phrase match
>> > > > > > > query should only return the single entry FD-A320-REC-SIM-1
>> > > > > > >
>> > > > > > > Do you have any ideas or tips, how we have to change our
>> current
>> > > > > > > analyzer to support this requirement???
>> > > > > > >
>> > > > > > >
>> > > > > > > Thanks and Kind regards
>> > > > > > > Diego
>> > > > > > >
>> > > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > --------------------------
>> > > > >
>> > > > > Benedetti Alessandro
>> > > > > Visiting card - http://about.me/alessandro_benedetti
>> > > > > Blog - http://alexbenedetti.blogspot.co.uk
>> > > > >
>> > > > > "Tyger, tyger burning bright
>> > > > > In the forests of the night,
>> > > > > What immortal hand or eye
>> > > > > Could frame thy fearful symmetry?"
>> > > > >
>> > > > > William Blake - Songs of Experience -1794 England
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > --------------------------
>> > >
>> > > Benedetti Alessandro
>> > > Visiting card - http://about.me/alessandro_benedetti
>> > > Blog - http://alexbenedetti.blogspot.co.uk
>> > >
>> > > "Tyger, tyger burning bright
>> > > In the forests of the night,
>> > > What immortal hand or eye
>> > > Could frame thy fearful symmetry?"
>> > >
>> > > William Blake - Songs of Experience -1794 England
>> > >
>> >
>>
>>
>>
>> --
>> --------------------------
>>
>> Benedetti Alessandro
>> Visiting card - http://about.me/alessandro_benedetti
>> Blog - http://alexbenedetti.blogspot.co.uk
>>
>> "Tyger, tyger burning bright
>> In the forests of the night,
>> What immortal hand or eye
>> Could frame thy fearful symmetry?"
>>
>> William Blake - Songs of Experience -1794 England
>>
>
>

Re: Analyzer for supporting hyphenated words

Reply via email to