sorry little code refactoring typo: curTokenProcessed should be
userCriteriaProcessed
...
public static final String EXACT_SEARCH_FORMAT = "\"%s\"";
public static final String MULTIPLE_CHARACTER_WILDCARD = "*";
...
if (isExactCriteriaString(userCriteria)) {
String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
escape(userCriteria.substring(1, userCriteria.length() - 1)));
userCriteriaProcessed = userCriteriaEscaped;
} else {
userCriteriaProcessed = escape(userCriteria);
if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
}
}
String queryStr = "";
for (String fieldName : fields) {
String escapedFieldName = escape(fieldName);
queryStr += String.format("%s:%s ", escapedFieldName,
userCriteriaProcessed);
}
query = new QueryParser("", analyzer).parse(queryStr.trim());
...
On Wed, Jul 22, 2015 at 12:27 PM, Diego Socaceti <[email protected]> wrote:
> Hi Alessandro,
>
> sorry, that i forgot the important part. Here it is:
>
> ...
>
> public static final String EXACT_SEARCH_FORMAT = "\"%s\"";
> public static final String MULTIPLE_CHARACTER_WILDCARD = "*";
>
> ...
>
> if (isExactCriteriaString(userCriteria)) {
> String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
> escape(userCriteria.substring(1, userCriteria.length() - 1)));
> userCriteriaProcessed = userCriteriaEscaped;
> } else {
> userCriteriaProcessed = escape(userCriteria);
>
> if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
> userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
> }
> }
>
>
> String queryStr = "";
>
> for (String fieldName : fields) {
> String escapedFieldName = escape(fieldName);
> queryStr += String.format("%s:%s ", escapedFieldName,
> curTokenProcessed);
> }
>
> query = new QueryParser("", analyzer).parse(queryStr.trim());
>
> ...
>
>
> As far as i understand my problem is, that in my - naive query syntax
> based solution -
> i have to use my analyzer, which means that the userCriteria is always
> tokenized.
>
> You suggest to use the java query classes to build the query, because than
> i can
> control if the userCriteria will be tokenized or not.
> Did i get you right?
>
>
> Thanks and Kind regards
>
> On Wed, Jul 22, 2015 at 11:44 AM, Alessandro Benedetti <
> [email protected]> wrote:
>
>> I read briefly, correct me if I am wrong, but that is to parse the content
>> within the quotes " .
>> But we are still at a String level.
>> I want to see how you build the phraseQuery :)
>> Taking a look to the code the PhraseQuery allow you to add as many terms
>> you want.
>>
>> What you need to do, it's to not tokenise the content within the quotes
>> and
>> create actually a TermQuery ( in your case you are not even using the
>> feature offered by the phrase query regarding positions, you simply want
>> to
>> run a TermQuery) .
>>
>> So to clarify you should parse the content within the quotes ( as you are
>> doing), than building a TermQuery out of that String, not tokenized at
>> all.
>>
>> Does this make sense to you ?
>> Can I see what you do after identifying the content within the quotes ?
>>
>> Cheers
>>
>>
>> 2015-07-22 10:20 GMT+01:00 Diego Socaceti <[email protected]>:
>>
>> > Hi Alessandro,
>> >
>> > i guess code says more than worlds :)
>> >
>> > ...
>> >
>> > public static final String EXACT_SEARCH_FORMAT = "\"%s\"";
>> > public static final String MULTIPLE_CHARACTER_WILDCARD = "*";
>> >
>> > ...
>> >
>> > if (isExactCriteriaString(userCriteria)) {
>> > String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
>> > escape(userCriteria.substring(1, userCriteria.length() - 1)));
>> > userCriteriaProcessed = userCriteriaEscaped;
>> > } else {
>> > userCriteriaProcessed = escape(userCriteria);
>> >
>> > if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
>> > userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
>> > }
>> > }
>> >
>> > ...
>> >
>> > public static String escape(String s) {
>> > String result = s;
>> >
>> > if (s != null && !s.trim().isEmpty()) {
>> > String toEscape = s.trim();
>> >
>> > if (toEscape.contains("*")) {
>> > StringBuilder sb = new StringBuilder();
>> >
>> > for (int i = 0; i < toEscape.length(); i++) {
>> > char curChar = toEscape.charAt(i);
>> > if (curChar == '*')
>> > sb.append('*');
>> > else
>> > sb.append(QueryParser.escape(toEscape.substring(i, i + 1)));
>> > }
>> >
>> > result = sb.toString();
>> > } else {
>> > result = QueryParser.escape(toEscape);
>> > }
>> > }
>> >
>> > return result;
>> > }
>> >
>> > ...
>> >
>> > Thanks and Kind regards
>> >
>> >
>> >
>> > On Wed, Jul 22, 2015 at 11:04 AM, Alessandro Benedetti <
>> > [email protected]> wrote:
>> >
>> > > As a start Diego, how do you currently parse the user query to build
>> the
>> > > Lucene queries ?
>> > >
>> > > Cheers
>> > >
>> > > 2015-07-22 8:35 GMT+01:00 Diego Socaceti <[email protected]>:
>> > >
>> > > > Hi Alessandro,
>> > > >
>> > > > yes, i want the user to be able to surround the query with "" to run
>> > the
>> > > > phrase query with a NOT tokenized phrase.
>> > > >
>> > > > What do i have to do?
>> > > >
>> > > > Thanks and Kind regards
>> > > >
>> > > > On Tue, Jul 21, 2015 at 2:47 PM, Alessandro Benedetti <
>> > > > [email protected]> wrote:
>> > > >
>> > > > > Hey Jack, reading the doc :
>> > > > >
>> > > > > " Set to true if phrase queries will be automatically generated
>> when
>> > > the
>> > > > > analyzer returns more than one term from whitespace delimited
>> text.
>> > > NOTE:
>> > > > > this behavior may not be suitable for all languages.
>> > > > >
>> > > > > Set to false if phrase queries should only be generated when
>> > surrounded
>> > > > by
>> > > > > double quotes."
>> > > > >
>> > > > >
>> > > > > In the user case , i guess he's likely to use double quotes.
>> > > > >
>> > > > > The only problem he sees so far is that the phrase query uses the
>> > query
>> > > > > time analyser to actually split the tokens.
>> > > > >
>> > > > > First we need a feedback from him, but I guess he would like to
>> have
>> > > the
>> > > > > phrase query, to not tokenise the text within the double quotes.
>> > > > >
>> > > > > In the case we should find a way.
>> > > > >
>> > > > >
>> > > > > Cheers
>> > > > >
>> > > > > 2015-07-21 13:12 GMT+01:00 Jack Krupansky <
>> [email protected]
>> > >:
>> > > > >
>> > > > > > If you don't explicitly enable automatic phrase queries, the
>> Lucene
>> > > > query
>> > > > > > parser will assume an OR operator on the sub-terms when a white
>> > > > > > space-delimited term analyzes into a sequence of terms.
>> > > > > >
>> > > > > > See:
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean)
>> > > > > >
>> > > > > >
>> > > > > > -- Jack Krupansky
>> > > > > >
>> > > > > > On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti <
>> > [email protected]>
>>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hi all,
>> > > > > > >
>> > > > > > > i'm new to lucene and tried to write my own analyzer to
>> support
>> > > > > > > hyphenated words like wi-fi, jean-pierre, etc.
>> > > > > > > For our customer it is important to find the word
>> > > > > > > - wi-fi by wi, fi, wifi, wi-fi
>> > > > > > > - jean-pierre by jean, pierre, jean-pierre, jean-*
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > The analyzer:
>> > > > > > > public class SupportHyphenatedWordsAnalyzer extends Analyzer {
>> > > > > > >
>> > > > > > > protected NormalizeCharMap charConvertMap;
>> > > > > > >
>> > > > > > > public MinLuceneAnalyzer() {
>> > > > > > > initCharConvertMap();
>> > > > > > > }
>> > > > > > >
>> > > > > > > protected void initCharConvertMap() {
>> > > > > > > NormalizeCharMap.Builder builder = new
>> > > > NormalizeCharMap.Builder();
>> > > > > > > builder.add("\"", "");
>> > > > > > > charConvertMap = builder.build();
>> > > > > > > }
>> > > > > > >
>> > > > > > > @Override
>> > > > > > > protected TokenStreamComponents createComponents(final
>> String
>> > > > > > fieldName)
>> > > > > > > {
>> > > > > > >
>> > > > > > > final Tokenizer src = new WhitespaceTokenizer();
>> > > > > > >
>> > > > > > > TokenStream tok = new WordDelimiterFilter(src,
>> > > > > > > WordDelimiterFilter.PRESERVE_ORIGINAL
>> > > > > > > | WordDelimiterFilter.GENERATE_WORD_PARTS
>> > > > > > > | WordDelimiterFilter.GENERATE_NUMBER_PARTS
>> > > > > > > | WordDelimiterFilter.CATENATE_WORDS,
>> > > > > > > null);
>> > > > > > > tok = new LowerCaseFilter(tok);
>> > > > > > > tok = new LengthFilter(tok, 1, 255);
>> > > > > > > tok = new StopFilter(tok,
>> > StopAnalyzer.ENGLISH_STOP_WORDS_SET);
>> > > > > > >
>> > > > > > > return new TokenStreamComponents(src, tok);
>> > > > > > > }
>> > > > > > >
>> > > > > > > @Override
>> > > > > > > protected Reader initReader(String fieldName, Reader
>> reader) {
>> > > > > > > return new MappingCharFilter(charConvertMap, reader);
>> > > > > > > }
>> > > > > > > }
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > The analyzer seems to work except for exact phrase match
>> queries.
>> > > > > > >
>> > > > > > > e.g. the following words are indexed
>> > > > > > >
>> > > > > > > FD-A320-REC-SIM-1
>> > > > > > > FD-A320-REC-SIM-10
>> > > > > > > FD-A320-REC-SIM-11
>> > > > > > > MIA-FD-A320-REC-SIM-1
>> > > > > > > SIN-FD-A320-REC-SIM-1
>> > > > > > >
>> > > > > > >
>> > > > > > > The (exact) query "FD-A320-REC-SIM-1" returns
>> > > > > > > FD-A320-REC-SIM-1
>> > > > > > > MIA-FD-A320-REC-SIM-1
>> > > > > > > SIN-FD-A320-REC-SIM-1
>> > > > > > >
>> > > > > > > for our customer this is wrong because this exact phrase match
>> > > > > > > query should only return the single entry FD-A320-REC-SIM-1
>> > > > > > >
>> > > > > > > Do you have any ideas or tips, how we have to change our
>> current
>> > > > > > > analyzer to support this requirement???
>> > > > > > >
>> > > > > > >
>> > > > > > > Thanks and Kind regards
>> > > > > > > Diego
>> > > > > > >
>> > > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > --------------------------
>> > > > >
>> > > > > Benedetti Alessandro
>> > > > > Visiting card - http://about.me/alessandro_benedetti
>> > > > > Blog - http://alexbenedetti.blogspot.co.uk
>> > > > >
>> > > > > "Tyger, tyger burning bright
>> > > > > In the forests of the night,
>> > > > > What immortal hand or eye
>> > > > > Could frame thy fearful symmetry?"
>> > > > >
>> > > > > William Blake - Songs of Experience -1794 England
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > --------------------------
>> > >
>> > > Benedetti Alessandro
>> > > Visiting card - http://about.me/alessandro_benedetti
>> > > Blog - http://alexbenedetti.blogspot.co.uk
>> > >
>> > > "Tyger, tyger burning bright
>> > > In the forests of the night,
>> > > What immortal hand or eye
>> > > Could frame thy fearful symmetry?"
>> > >
>> > > William Blake - Songs of Experience -1794 England
>> > >
>> >
>>
>>
>>
>> --
>> --------------------------
>>
>> Benedetti Alessandro
>> Visiting card - http://about.me/alessandro_benedetti
>> Blog - http://alexbenedetti.blogspot.co.uk
>>
>> "Tyger, tyger burning bright
>> In the forests of the night,
>> What immortal hand or eye
>> Could frame thy fearful symmetry?"
>>
>> William Blake - Songs of Experience -1794 England
>>
>
>