Re: Analyzer for supporting hyphenated words

Diego Socaceti Thu, 23 Jul 2015 23:44:06 -0700

Hi Alessandro,

after talking to our customer:
Yes, it needs to be a mix of classic and quoted queries in one userCriteria.


Before we look into the details of the QueryParser. I'm currently using
org.apache.lucene.queryparser.classic.QueryParser of 5.2.1.
Is this the right QueryParser to use?


Thanks and Kind regards


On Wed, Jul 22, 2015 at 12:50 PM, Alessandro Benedetti <
benedetti.ale...@gmail.com> wrote:

> Yes what I meant is that you actually can use your analyser when the query
> is not in the quotes.
> When in the quotes you can directly build  a term Query out of it.
> Now of course it is not so simple scenario, do you think quoted query and
> not quoted query parts are 2 different set of queries, which intersection
> is always empty ? i.e. a user OR ask for a quoted query OR for a classic
> query ?
> In that scenario it will be simple.
>
> In the case of a mix, we should take a look better to the lucene query
> parser code and see how the tokenization of content within quotes is
> handled.
>
> Cheers
>
> 2015-07-22 11:32 GMT+01:00 Diego Socaceti <socac...@gmail.com>:
>
> > sorry little code refactoring typo: curTokenProcessed should be
> > userCriteriaProcessed
> >
> > ...
> >
> > public static final String EXACT_SEARCH_FORMAT = "\"%s\"";
> > public static final String MULTIPLE_CHARACTER_WILDCARD = "*";
> >
> > ...
> >
> >   if (isExactCriteriaString(userCriteria)) {
> >     String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
> >         escape(userCriteria.substring(1, userCriteria.length() - 1)));
> >     userCriteriaProcessed = userCriteriaEscaped;
> >   } else {
> >     userCriteriaProcessed = escape(userCriteria);
> >
> >     if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
> >       userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
> >     }
> >   }
> >
> >
> >   String queryStr = "";
> >
> >   for (String fieldName : fields) {
> >     String escapedFieldName = escape(fieldName);
> >     queryStr += String.format("%s:%s ", escapedFieldName,
> > userCriteriaProcessed);
> >   }
> >
> >   query = new QueryParser("", analyzer).parse(queryStr.trim());
> >
> > ...
> >
> > On Wed, Jul 22, 2015 at 12:27 PM, Diego Socaceti <socac...@gmail.com>
> > wrote:
> >
> > > Hi Alessandro,
> > >
> > > sorry, that i forgot the important part. Here it is:
> > >
> > > ...
> > >
> > > public static final String EXACT_SEARCH_FORMAT = "\"%s\"";
> > > public static final String MULTIPLE_CHARACTER_WILDCARD = "*";
> > >
> > > ...
> > >
> > >   if (isExactCriteriaString(userCriteria)) {
> > >     String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
> > >         escape(userCriteria.substring(1, userCriteria.length() - 1)));
> > >     userCriteriaProcessed = userCriteriaEscaped;
> > >   } else {
> > >     userCriteriaProcessed = escape(userCriteria);
> > >
> > >     if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
> > >       userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
> > >     }
> > >   }
> > >
> > >
> > >   String queryStr = "";
> > >
> > >   for (String fieldName : fields) {
> > >     String escapedFieldName = escape(fieldName);
> > >     queryStr += String.format("%s:%s ", escapedFieldName,
> > > curTokenProcessed);
> > >   }
> > >
> > >   query = new QueryParser("", analyzer).parse(queryStr.trim());
> > >
> > > ...
> > >
> > >
> > > As far as i understand my problem is, that in my - naive query syntax
> > > based solution -
> > > i have to use my analyzer, which means that the userCriteria is always
> > > tokenized.
> > >
> > > You suggest to use the java query classes to build the query, because
> > than
> > > i can
> > > control if the userCriteria will be tokenized or not.
> > > Did i get you right?
> > >
> > >
> > > Thanks and Kind regards
> > >
> > > On Wed, Jul 22, 2015 at 11:44 AM, Alessandro Benedetti <
> > > benedetti.ale...@gmail.com> wrote:
> > >
> > >> I read briefly, correct me if I am wrong, but that is to parse the
> > content
> > >> within the quotes " .
> > >> But we are still at a String level.
> > >> I want to see how you build the phraseQuery :)
> > >> Taking a look to the code the PhraseQuery allow you to add as many
> terms
> > >> you want.
> > >>
> > >> What you need to do, it's to not tokenise the content within the
> quotes
> > >> and
> > >> create actually a TermQuery ( in your case you are not even using the
> > >> feature offered by the phrase query regarding positions, you simply
> want
> > >> to
> > >> run a TermQuery) .
> > >>
> > >> So to clarify you should parse the content within the quotes ( as you
> > are
> > >> doing), than building a TermQuery out of that String, not tokenized at
> > >> all.
> > >>
> > >> Does this make sense to you ?
> > >> Can I see what you do after identifying the content within the quotes
> ?
> > >>
> > >> Cheers
> > >>
> > >>
> > >> 2015-07-22 10:20 GMT+01:00 Diego Socaceti <socac...@gmail.com>:
> > >>
> > >> > Hi Alessandro,
> > >> >
> > >> > i guess code says more than worlds :)
> > >> >
> > >> > ...
> > >> >
> > >> > public static final String EXACT_SEARCH_FORMAT = "\"%s\"";
> > >> > public static final String MULTIPLE_CHARACTER_WILDCARD = "*";
> > >> >
> > >> > ...
> > >> >
> > >> >   if (isExactCriteriaString(userCriteria)) {
> > >> >     String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
> > >> >         escape(userCriteria.substring(1, userCriteria.length() -
> 1)));
> > >> >     userCriteriaProcessed = userCriteriaEscaped;
> > >> >   } else {
> > >> >     userCriteriaProcessed = escape(userCriteria);
> > >> >
> > >> >     if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
> > >> >       userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
> > >> >     }
> > >> >   }
> > >> >
> > >> > ...
> > >> >
> > >> > public static String escape(String s) {
> > >> >   String result = s;
> > >> >
> > >> >   if (s != null && !s.trim().isEmpty()) {
> > >> >     String toEscape = s.trim();
> > >> >
> > >> >     if (toEscape.contains("*")) {
> > >> >       StringBuilder sb = new StringBuilder();
> > >> >
> > >> >       for (int i = 0; i < toEscape.length(); i++) {
> > >> >         char curChar = toEscape.charAt(i);
> > >> >         if (curChar == '*')
> > >> >           sb.append('*');
> > >> >         else
> > >> >           sb.append(QueryParser.escape(toEscape.substring(i, i +
> 1)));
> > >> >       }
> > >> >
> > >> >       result = sb.toString();
> > >> >     } else {
> > >> >       result = QueryParser.escape(toEscape);
> > >> >     }
> > >> >   }
> > >> >
> > >> >   return result;
> > >> > }
> > >> >
> > >> > ...
> > >> >
> > >> > Thanks and Kind regards
> > >> >
> > >> >
> > >> >
> > >> > On Wed, Jul 22, 2015 at 11:04 AM, Alessandro Benedetti <
> > >> > benedetti.ale...@gmail.com> wrote:
> > >> >
> > >> > > As a start Diego, how do you currently parse the user query to
> build
> > >> the
> > >> > > Lucene queries ?
> > >> > >
> > >> > > Cheers
> > >> > >
> > >> > > 2015-07-22 8:35 GMT+01:00 Diego Socaceti <socac...@gmail.com>:
> > >> > >
> > >> > > > Hi Alessandro,
> > >> > > >
> > >> > > > yes, i want the user to be able to surround the query with "" to
> > run
> > >> > the
> > >> > > > phrase query with a NOT tokenized phrase.
> > >> > > >
> > >> > > > What do i have to do?
> > >> > > >
> > >> > > > Thanks and Kind regards
> > >> > > >
> > >> > > > On Tue, Jul 21, 2015 at 2:47 PM, Alessandro Benedetti <
> > >> > > > benedetti.ale...@gmail.com> wrote:
> > >> > > >
> > >> > > > > Hey Jack, reading the doc :
> > >> > > > >
> > >> > > > > " Set to true if phrase queries will be automatically
> generated
> > >> when
> > >> > > the
> > >> > > > > analyzer returns more than one term from whitespace delimited
> > >> text.
> > >> > > NOTE:
> > >> > > > > this behavior may not be suitable for all languages.
> > >> > > > >
> > >> > > > > Set to false if phrase queries should only be generated when
> > >> > surrounded
> > >> > > > by
> > >> > > > > double quotes."
> > >> > > > >
> > >> > > > >
> > >> > > > > In the user case , i guess he's likely to use double quotes.
> > >> > > > >
> > >> > > > > The only problem he sees so far is that the phrase query uses
> > the
> > >> > query
> > >> > > > > time analyser to actually split the tokens.
> > >> > > > >
> > >> > > > > First we need a feedback from him, but I guess he would like
> to
> > >> have
> > >> > > the
> > >> > > > > phrase query, to not tokenise the text within the double
> quotes.
> > >> > > > >
> > >> > > > > In the case we should find a way.
> > >> > > > >
> > >> > > > >
> > >> > > > > Cheers
> > >> > > > >
> > >> > > > > 2015-07-21 13:12 GMT+01:00 Jack Krupansky <
> > >> jack.krupan...@gmail.com
> > >> > >:
> > >> > > > >
> > >> > > > > > If you don't explicitly enable automatic phrase queries, the
> > >> Lucene
> > >> > > > query
> > >> > > > > > parser will assume an OR operator on the sub-terms when a
> > white
> > >> > > > > > space-delimited term analyzes into a sequence of terms.
> > >> > > > > >
> > >> > > > > > See:
> > >> > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean)
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > -- Jack Krupansky
> > >> > > > > >
> > >> > > > > > On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti <
> > >> > socac...@gmail.com>
> > >>
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > Hi all,
> > >> > > > > > >
> > >> > > > > > > i'm new to lucene and tried to write my own analyzer to
> > >> support
> > >> > > > > > > hyphenated words like wi-fi, jean-pierre, etc.
> > >> > > > > > > For our customer it is important to find the word
> > >> > > > > > > - wi-fi by wi, fi, wifi, wi-fi
> > >> > > > > > > - jean-pierre by jean, pierre, jean-pierre, jean-*
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > The analyzer:
> > >> > > > > > > public class SupportHyphenatedWordsAnalyzer extends
> > Analyzer {
> > >> > > > > > >
> > >> > > > > > >   protected NormalizeCharMap charConvertMap;
> > >> > > > > > >
> > >> > > > > > >   public MinLuceneAnalyzer() {
> > >> > > > > > >     initCharConvertMap();
> > >> > > > > > >   }
> > >> > > > > > >
> > >> > > > > > >   protected void initCharConvertMap() {
> > >> > > > > > >     NormalizeCharMap.Builder builder = new
> > >> > > > NormalizeCharMap.Builder();
> > >> > > > > > >     builder.add("\"", "");
> > >> > > > > > >     charConvertMap = builder.build();
> > >> > > > > > >   }
> > >> > > > > > >
> > >> > > > > > >   @Override
> > >> > > > > > >   protected TokenStreamComponents createComponents(final
> > >> String
> > >> > > > > > fieldName)
> > >> > > > > > > {
> > >> > > > > > >
> > >> > > > > > >     final Tokenizer src = new WhitespaceTokenizer();
> > >> > > > > > >
> > >> > > > > > >     TokenStream tok = new WordDelimiterFilter(src,
> > >> > > > > > >         WordDelimiterFilter.PRESERVE_ORIGINAL
> > >> > > > > > >             | WordDelimiterFilter.GENERATE_WORD_PARTS
> > >> > > > > > >             | WordDelimiterFilter.GENERATE_NUMBER_PARTS
> > >> > > > > > >             | WordDelimiterFilter.CATENATE_WORDS,
> > >> > > > > > >         null);
> > >> > > > > > >     tok = new LowerCaseFilter(tok);
> > >> > > > > > >     tok = new LengthFilter(tok, 1, 255);
> > >> > > > > > >     tok = new StopFilter(tok,
> > >> > StopAnalyzer.ENGLISH_STOP_WORDS_SET);
> > >> > > > > > >
> > >> > > > > > >     return new TokenStreamComponents(src, tok);
> > >> > > > > > >   }
> > >> > > > > > >
> > >> > > > > > >   @Override
> > >> > > > > > >   protected Reader initReader(String fieldName, Reader
> > >> reader) {
> > >> > > > > > >     return new MappingCharFilter(charConvertMap, reader);
> > >> > > > > > >   }
> > >> > > > > > > }
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > The analyzer seems to work except for exact phrase match
> > >> queries.
> > >> > > > > > >
> > >> > > > > > > e.g. the following words are indexed
> > >> > > > > > >
> > >> > > > > > > FD-A320-REC-SIM-1
> > >> > > > > > > FD-A320-REC-SIM-10
> > >> > > > > > > FD-A320-REC-SIM-11
> > >> > > > > > > MIA-FD-A320-REC-SIM-1
> > >> > > > > > > SIN-FD-A320-REC-SIM-1
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > The (exact) query "FD-A320-REC-SIM-1" returns
> > >> > > > > > > FD-A320-REC-SIM-1
> > >> > > > > > > MIA-FD-A320-REC-SIM-1
> > >> > > > > > > SIN-FD-A320-REC-SIM-1
> > >> > > > > > >
> > >> > > > > > > for our customer this is wrong because this exact phrase
> > match
> > >> > > > > > > query should only return the single entry
> FD-A320-REC-SIM-1
> > >> > > > > > >
> > >> > > > > > > Do you have any ideas or tips, how we have to change our
> > >> current
> > >> > > > > > > analyzer to support this requirement???
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > Thanks and Kind regards
> > >> > > > > > > Diego
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > --
> > >> > > > > --------------------------
> > >> > > > >
> > >> > > > > Benedetti Alessandro
> > >> > > > > Visiting card - http://about.me/alessandro_benedetti
> > >> > > > > Blog - http://alexbenedetti.blogspot.co.uk
> > >> > > > >
> > >> > > > > "Tyger, tyger burning bright
> > >> > > > > In the forests of the night,
> > >> > > > > What immortal hand or eye
> > >> > > > > Could frame thy fearful symmetry?"
> > >> > > > >
> > >> > > > > William Blake - Songs of Experience -1794 England
> > >> > > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > --------------------------
> > >> > >
> > >> > > Benedetti Alessandro
> > >> > > Visiting card - http://about.me/alessandro_benedetti
> > >> > > Blog - http://alexbenedetti.blogspot.co.uk
> > >> > >
> > >> > > "Tyger, tyger burning bright
> > >> > > In the forests of the night,
> > >> > > What immortal hand or eye
> > >> > > Could frame thy fearful symmetry?"
> > >> > >
> > >> > > William Blake - Songs of Experience -1794 England
> > >> > >
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> --------------------------
> > >>
> > >> Benedetti Alessandro
> > >> Visiting card - http://about.me/alessandro_benedetti
> > >> Blog - http://alexbenedetti.blogspot.co.uk
> > >>
> > >> "Tyger, tyger burning bright
> > >> In the forests of the night,
> > >> What immortal hand or eye
> > >> Could frame thy fearful symmetry?"
> > >>
> > >> William Blake - Songs of Experience -1794 England
> > >>
> > >
> > >
> >
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

Re: Analyzer for supporting hyphenated words

Reply via email to