Re: Analyzer for supporting hyphenated words
Hi Alessandro, after talking to our customer: Yes, it needs to be a mix of classic and quoted queries in one userCriteria. Before we look into the details of the QueryParser. I'm currently using org.apache.lucene.queryparser.classic.QueryParser of 5.2.1. Is this the right QueryParser to use? Thanks and Kind regards On Wed, Jul 22, 2015 at 12:50 PM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Yes what I meant is that you actually can use your analyser when the query is not in the quotes. When in the quotes you can directly build a term Query out of it. Now of course it is not so simple scenario, do you think quoted query and not quoted query parts are 2 different set of queries, which intersection is always empty ? i.e. a user OR ask for a quoted query OR for a classic query ? In that scenario it will be simple. In the case of a mix, we should take a look better to the lucene query parser code and see how the tokenization of content within quotes is handled. Cheers 2015-07-22 11:32 GMT+01:00 Diego Socaceti socac...@gmail.com: sorry little code refactoring typo: curTokenProcessed should be userCriteriaProcessed ... public static final String EXACT_SEARCH_FORMAT = \%s\; public static final String MULTIPLE_CHARACTER_WILDCARD = *; ... if (isExactCriteriaString(userCriteria)) { String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT, escape(userCriteria.substring(1, userCriteria.length() - 1))); userCriteriaProcessed = userCriteriaEscaped; } else { userCriteriaProcessed = escape(userCriteria); if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) { userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD; } } String queryStr = ; for (String fieldName : fields) { String escapedFieldName = escape(fieldName); queryStr += String.format(%s:%s , escapedFieldName, userCriteriaProcessed); } query = new QueryParser(, analyzer).parse(queryStr.trim()); ... On Wed, Jul 22, 2015 at 12:27 PM, Diego Socaceti socac...@gmail.com wrote: Hi Alessandro, sorry, that i forgot the important part. Here it is: ... public static final String EXACT_SEARCH_FORMAT = \%s\; public static final String MULTIPLE_CHARACTER_WILDCARD = *; ... if (isExactCriteriaString(userCriteria)) { String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT, escape(userCriteria.substring(1, userCriteria.length() - 1))); userCriteriaProcessed = userCriteriaEscaped; } else { userCriteriaProcessed = escape(userCriteria); if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) { userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD; } } String queryStr = ; for (String fieldName : fields) { String escapedFieldName = escape(fieldName); queryStr += String.format(%s:%s , escapedFieldName, curTokenProcessed); } query = new QueryParser(, analyzer).parse(queryStr.trim()); ... As far as i understand my problem is, that in my - naive query syntax based solution - i have to use my analyzer, which means that the userCriteria is always tokenized. You suggest to use the java query classes to build the query, because than i can control if the userCriteria will be tokenized or not. Did i get you right? Thanks and Kind regards On Wed, Jul 22, 2015 at 11:44 AM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: I read briefly, correct me if I am wrong, but that is to parse the content within the quotes . But we are still at a String level. I want to see how you build the phraseQuery :) Taking a look to the code the PhraseQuery allow you to add as many terms you want. What you need to do, it's to not tokenise the content within the quotes and create actually a TermQuery ( in your case you are not even using the feature offered by the phrase query regarding positions, you simply want to run a TermQuery) . So to clarify you should parse the content within the quotes ( as you are doing), than building a TermQuery out of that String, not tokenized at all. Does this make sense to you ? Can I see what you do after identifying the content within the quotes ? Cheers 2015-07-22 10:20 GMT+01:00 Diego Socaceti socac...@gmail.com: Hi Alessandro, i guess code says more than worlds :) ... public static final String EXACT_SEARCH_FORMAT = \%s\; public static final String MULTIPLE_CHARACTER_WILDCARD = *; ... if (isExactCriteriaString(userCriteria)) { String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT, escape(userCriteria.substring(1, userCriteria.length() - 1))); userCriteriaProcessed = userCriteriaEscaped; } else {
Re: Analyzer for supporting hyphenated words
Hi Alessandro, i guess code says more than worlds :) ... public static final String EXACT_SEARCH_FORMAT = \%s\; public static final String MULTIPLE_CHARACTER_WILDCARD = *; ... if (isExactCriteriaString(userCriteria)) { String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT, escape(userCriteria.substring(1, userCriteria.length() - 1))); userCriteriaProcessed = userCriteriaEscaped; } else { userCriteriaProcessed = escape(userCriteria); if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) { userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD; } } ... public static String escape(String s) { String result = s; if (s != null !s.trim().isEmpty()) { String toEscape = s.trim(); if (toEscape.contains(*)) { StringBuilder sb = new StringBuilder(); for (int i = 0; i toEscape.length(); i++) { char curChar = toEscape.charAt(i); if (curChar == '*') sb.append('*'); else sb.append(QueryParser.escape(toEscape.substring(i, i + 1))); } result = sb.toString(); } else { result = QueryParser.escape(toEscape); } } return result; } ... Thanks and Kind regards On Wed, Jul 22, 2015 at 11:04 AM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: As a start Diego, how do you currently parse the user query to build the Lucene queries ? Cheers 2015-07-22 8:35 GMT+01:00 Diego Socaceti socac...@gmail.com: Hi Alessandro, yes, i want the user to be able to surround the query with to run the phrase query with a NOT tokenized phrase. What do i have to do? Thanks and Kind regards On Tue, Jul 21, 2015 at 2:47 PM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Hey Jack, reading the doc : Set to true if phrase queries will be automatically generated when the analyzer returns more than one term from whitespace delimited text. NOTE: this behavior may not be suitable for all languages. Set to false if phrase queries should only be generated when surrounded by double quotes. In the user case , i guess he's likely to use double quotes. The only problem he sees so far is that the phrase query uses the query time analyser to actually split the tokens. First we need a feedback from him, but I guess he would like to have the phrase query, to not tokenise the text within the double quotes. In the case we should find a way. Cheers 2015-07-21 13:12 GMT+01:00 Jack Krupansky jack.krupan...@gmail.com: If you don't explicitly enable automatic phrase queries, the Lucene query parser will assume an OR operator on the sub-terms when a white space-delimited term analyzes into a sequence of terms. See: https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean) -- Jack Krupansky On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti socac...@gmail.com wrote: Hi all, i'm new to lucene and tried to write my own analyzer to support hyphenated words like wi-fi, jean-pierre, etc. For our customer it is important to find the word - wi-fi by wi, fi, wifi, wi-fi - jean-pierre by jean, pierre, jean-pierre, jean-* The analyzer: public class SupportHyphenatedWordsAnalyzer extends Analyzer { protected NormalizeCharMap charConvertMap; public MinLuceneAnalyzer() { initCharConvertMap(); } protected void initCharConvertMap() { NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder(); builder.add(\, ); charConvertMap = builder.build(); } @Override protected TokenStreamComponents createComponents(final String fieldName) { final Tokenizer src = new WhitespaceTokenizer(); TokenStream tok = new WordDelimiterFilter(src, WordDelimiterFilter.PRESERVE_ORIGINAL | WordDelimiterFilter.GENERATE_WORD_PARTS | WordDelimiterFilter.GENERATE_NUMBER_PARTS | WordDelimiterFilter.CATENATE_WORDS, null); tok = new LowerCaseFilter(tok); tok = new LengthFilter(tok, 1, 255); tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET); return new TokenStreamComponents(src, tok); } @Override protected Reader initReader(String fieldName, Reader reader) { return new MappingCharFilter(charConvertMap, reader); } } The analyzer seems to work except for exact phrase match queries. e.g. the following words are indexed FD-A320-REC-SIM-1 FD-A320-REC-SIM-10 FD-A320-REC-SIM-11 MIA-FD-A320-REC-SIM-1 SIN-FD-A320-REC-SIM-1 The
Re: Analyzer for supporting hyphenated words
sorry little code refactoring typo: curTokenProcessed should be userCriteriaProcessed ... public static final String EXACT_SEARCH_FORMAT = \%s\; public static final String MULTIPLE_CHARACTER_WILDCARD = *; ... if (isExactCriteriaString(userCriteria)) { String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT, escape(userCriteria.substring(1, userCriteria.length() - 1))); userCriteriaProcessed = userCriteriaEscaped; } else { userCriteriaProcessed = escape(userCriteria); if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) { userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD; } } String queryStr = ; for (String fieldName : fields) { String escapedFieldName = escape(fieldName); queryStr += String.format(%s:%s , escapedFieldName, userCriteriaProcessed); } query = new QueryParser(, analyzer).parse(queryStr.trim()); ... On Wed, Jul 22, 2015 at 12:27 PM, Diego Socaceti socac...@gmail.com wrote: Hi Alessandro, sorry, that i forgot the important part. Here it is: ... public static final String EXACT_SEARCH_FORMAT = \%s\; public static final String MULTIPLE_CHARACTER_WILDCARD = *; ... if (isExactCriteriaString(userCriteria)) { String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT, escape(userCriteria.substring(1, userCriteria.length() - 1))); userCriteriaProcessed = userCriteriaEscaped; } else { userCriteriaProcessed = escape(userCriteria); if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) { userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD; } } String queryStr = ; for (String fieldName : fields) { String escapedFieldName = escape(fieldName); queryStr += String.format(%s:%s , escapedFieldName, curTokenProcessed); } query = new QueryParser(, analyzer).parse(queryStr.trim()); ... As far as i understand my problem is, that in my - naive query syntax based solution - i have to use my analyzer, which means that the userCriteria is always tokenized. You suggest to use the java query classes to build the query, because than i can control if the userCriteria will be tokenized or not. Did i get you right? Thanks and Kind regards On Wed, Jul 22, 2015 at 11:44 AM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: I read briefly, correct me if I am wrong, but that is to parse the content within the quotes . But we are still at a String level. I want to see how you build the phraseQuery :) Taking a look to the code the PhraseQuery allow you to add as many terms you want. What you need to do, it's to not tokenise the content within the quotes and create actually a TermQuery ( in your case you are not even using the feature offered by the phrase query regarding positions, you simply want to run a TermQuery) . So to clarify you should parse the content within the quotes ( as you are doing), than building a TermQuery out of that String, not tokenized at all. Does this make sense to you ? Can I see what you do after identifying the content within the quotes ? Cheers 2015-07-22 10:20 GMT+01:00 Diego Socaceti socac...@gmail.com: Hi Alessandro, i guess code says more than worlds :) ... public static final String EXACT_SEARCH_FORMAT = \%s\; public static final String MULTIPLE_CHARACTER_WILDCARD = *; ... if (isExactCriteriaString(userCriteria)) { String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT, escape(userCriteria.substring(1, userCriteria.length() - 1))); userCriteriaProcessed = userCriteriaEscaped; } else { userCriteriaProcessed = escape(userCriteria); if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) { userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD; } } ... public static String escape(String s) { String result = s; if (s != null !s.trim().isEmpty()) { String toEscape = s.trim(); if (toEscape.contains(*)) { StringBuilder sb = new StringBuilder(); for (int i = 0; i toEscape.length(); i++) { char curChar = toEscape.charAt(i); if (curChar == '*') sb.append('*'); else sb.append(QueryParser.escape(toEscape.substring(i, i + 1))); } result = sb.toString(); } else { result = QueryParser.escape(toEscape); } } return result; } ... Thanks and Kind regards On Wed, Jul 22, 2015 at 11:04 AM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: As a start Diego, how do you currently parse the user query to build the Lucene queries ? Cheers 2015-07-22 8:35 GMT+01:00 Diego Socaceti socac...@gmail.com: Hi Alessandro, yes, i want the user to be able to surround the query with to run the phrase query with a NOT tokenized phrase. What do i have to do? Thanks and Kind regards
Re: Analyzer for supporting hyphenated words
Yes what I meant is that you actually can use your analyser when the query is not in the quotes. When in the quotes you can directly build a term Query out of it. Now of course it is not so simple scenario, do you think quoted query and not quoted query parts are 2 different set of queries, which intersection is always empty ? i.e. a user OR ask for a quoted query OR for a classic query ? In that scenario it will be simple. In the case of a mix, we should take a look better to the lucene query parser code and see how the tokenization of content within quotes is handled. Cheers 2015-07-22 11:32 GMT+01:00 Diego Socaceti socac...@gmail.com: sorry little code refactoring typo: curTokenProcessed should be userCriteriaProcessed ... public static final String EXACT_SEARCH_FORMAT = \%s\; public static final String MULTIPLE_CHARACTER_WILDCARD = *; ... if (isExactCriteriaString(userCriteria)) { String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT, escape(userCriteria.substring(1, userCriteria.length() - 1))); userCriteriaProcessed = userCriteriaEscaped; } else { userCriteriaProcessed = escape(userCriteria); if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) { userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD; } } String queryStr = ; for (String fieldName : fields) { String escapedFieldName = escape(fieldName); queryStr += String.format(%s:%s , escapedFieldName, userCriteriaProcessed); } query = new QueryParser(, analyzer).parse(queryStr.trim()); ... On Wed, Jul 22, 2015 at 12:27 PM, Diego Socaceti socac...@gmail.com wrote: Hi Alessandro, sorry, that i forgot the important part. Here it is: ... public static final String EXACT_SEARCH_FORMAT = \%s\; public static final String MULTIPLE_CHARACTER_WILDCARD = *; ... if (isExactCriteriaString(userCriteria)) { String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT, escape(userCriteria.substring(1, userCriteria.length() - 1))); userCriteriaProcessed = userCriteriaEscaped; } else { userCriteriaProcessed = escape(userCriteria); if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) { userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD; } } String queryStr = ; for (String fieldName : fields) { String escapedFieldName = escape(fieldName); queryStr += String.format(%s:%s , escapedFieldName, curTokenProcessed); } query = new QueryParser(, analyzer).parse(queryStr.trim()); ... As far as i understand my problem is, that in my - naive query syntax based solution - i have to use my analyzer, which means that the userCriteria is always tokenized. You suggest to use the java query classes to build the query, because than i can control if the userCriteria will be tokenized or not. Did i get you right? Thanks and Kind regards On Wed, Jul 22, 2015 at 11:44 AM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: I read briefly, correct me if I am wrong, but that is to parse the content within the quotes . But we are still at a String level. I want to see how you build the phraseQuery :) Taking a look to the code the PhraseQuery allow you to add as many terms you want. What you need to do, it's to not tokenise the content within the quotes and create actually a TermQuery ( in your case you are not even using the feature offered by the phrase query regarding positions, you simply want to run a TermQuery) . So to clarify you should parse the content within the quotes ( as you are doing), than building a TermQuery out of that String, not tokenized at all. Does this make sense to you ? Can I see what you do after identifying the content within the quotes ? Cheers 2015-07-22 10:20 GMT+01:00 Diego Socaceti socac...@gmail.com: Hi Alessandro, i guess code says more than worlds :) ... public static final String EXACT_SEARCH_FORMAT = \%s\; public static final String MULTIPLE_CHARACTER_WILDCARD = *; ... if (isExactCriteriaString(userCriteria)) { String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT, escape(userCriteria.substring(1, userCriteria.length() - 1))); userCriteriaProcessed = userCriteriaEscaped; } else { userCriteriaProcessed = escape(userCriteria); if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) { userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD; } } ... public static String escape(String s) { String result = s; if (s != null !s.trim().isEmpty()) { String toEscape = s.trim(); if (toEscape.contains(*)) { StringBuilder sb = new StringBuilder(); for (int i = 0; i toEscape.length(); i++) { char curChar = toEscape.charAt(i); if (curChar ==
Re: Analyzer for supporting hyphenated words
I read briefly, correct me if I am wrong, but that is to parse the content within the quotes . But we are still at a String level. I want to see how you build the phraseQuery :) Taking a look to the code the PhraseQuery allow you to add as many terms you want. What you need to do, it's to not tokenise the content within the quotes and create actually a TermQuery ( in your case you are not even using the feature offered by the phrase query regarding positions, you simply want to run a TermQuery) . So to clarify you should parse the content within the quotes ( as you are doing), than building a TermQuery out of that String, not tokenized at all. Does this make sense to you ? Can I see what you do after identifying the content within the quotes ? Cheers 2015-07-22 10:20 GMT+01:00 Diego Socaceti socac...@gmail.com: Hi Alessandro, i guess code says more than worlds :) ... public static final String EXACT_SEARCH_FORMAT = \%s\; public static final String MULTIPLE_CHARACTER_WILDCARD = *; ... if (isExactCriteriaString(userCriteria)) { String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT, escape(userCriteria.substring(1, userCriteria.length() - 1))); userCriteriaProcessed = userCriteriaEscaped; } else { userCriteriaProcessed = escape(userCriteria); if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) { userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD; } } ... public static String escape(String s) { String result = s; if (s != null !s.trim().isEmpty()) { String toEscape = s.trim(); if (toEscape.contains(*)) { StringBuilder sb = new StringBuilder(); for (int i = 0; i toEscape.length(); i++) { char curChar = toEscape.charAt(i); if (curChar == '*') sb.append('*'); else sb.append(QueryParser.escape(toEscape.substring(i, i + 1))); } result = sb.toString(); } else { result = QueryParser.escape(toEscape); } } return result; } ... Thanks and Kind regards On Wed, Jul 22, 2015 at 11:04 AM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: As a start Diego, how do you currently parse the user query to build the Lucene queries ? Cheers 2015-07-22 8:35 GMT+01:00 Diego Socaceti socac...@gmail.com: Hi Alessandro, yes, i want the user to be able to surround the query with to run the phrase query with a NOT tokenized phrase. What do i have to do? Thanks and Kind regards On Tue, Jul 21, 2015 at 2:47 PM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Hey Jack, reading the doc : Set to true if phrase queries will be automatically generated when the analyzer returns more than one term from whitespace delimited text. NOTE: this behavior may not be suitable for all languages. Set to false if phrase queries should only be generated when surrounded by double quotes. In the user case , i guess he's likely to use double quotes. The only problem he sees so far is that the phrase query uses the query time analyser to actually split the tokens. First we need a feedback from him, but I guess he would like to have the phrase query, to not tokenise the text within the double quotes. In the case we should find a way. Cheers 2015-07-21 13:12 GMT+01:00 Jack Krupansky jack.krupan...@gmail.com : If you don't explicitly enable automatic phrase queries, the Lucene query parser will assume an OR operator on the sub-terms when a white space-delimited term analyzes into a sequence of terms. See: https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean) -- Jack Krupansky On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti socac...@gmail.com wrote: Hi all, i'm new to lucene and tried to write my own analyzer to support hyphenated words like wi-fi, jean-pierre, etc. For our customer it is important to find the word - wi-fi by wi, fi, wifi, wi-fi - jean-pierre by jean, pierre, jean-pierre, jean-* The analyzer: public class SupportHyphenatedWordsAnalyzer extends Analyzer { protected NormalizeCharMap charConvertMap; public MinLuceneAnalyzer() { initCharConvertMap(); } protected void initCharConvertMap() { NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder(); builder.add(\, ); charConvertMap = builder.build(); } @Override protected TokenStreamComponents createComponents(final String fieldName) { final Tokenizer src = new WhitespaceTokenizer();
Re: Analyzer for supporting hyphenated words
Hi Alessandro, sorry, that i forgot the important part. Here it is: ... public static final String EXACT_SEARCH_FORMAT = \%s\; public static final String MULTIPLE_CHARACTER_WILDCARD = *; ... if (isExactCriteriaString(userCriteria)) { String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT, escape(userCriteria.substring(1, userCriteria.length() - 1))); userCriteriaProcessed = userCriteriaEscaped; } else { userCriteriaProcessed = escape(userCriteria); if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) { userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD; } } String queryStr = ; for (String fieldName : fields) { String escapedFieldName = escape(fieldName); queryStr += String.format(%s:%s , escapedFieldName, curTokenProcessed); } query = new QueryParser(, analyzer).parse(queryStr.trim()); ... As far as i understand my problem is, that in my - naive query syntax based solution - i have to use my analyzer, which means that the userCriteria is always tokenized. You suggest to use the java query classes to build the query, because than i can control if the userCriteria will be tokenized or not. Did i get you right? Thanks and Kind regards On Wed, Jul 22, 2015 at 11:44 AM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: I read briefly, correct me if I am wrong, but that is to parse the content within the quotes . But we are still at a String level. I want to see how you build the phraseQuery :) Taking a look to the code the PhraseQuery allow you to add as many terms you want. What you need to do, it's to not tokenise the content within the quotes and create actually a TermQuery ( in your case you are not even using the feature offered by the phrase query regarding positions, you simply want to run a TermQuery) . So to clarify you should parse the content within the quotes ( as you are doing), than building a TermQuery out of that String, not tokenized at all. Does this make sense to you ? Can I see what you do after identifying the content within the quotes ? Cheers 2015-07-22 10:20 GMT+01:00 Diego Socaceti socac...@gmail.com: Hi Alessandro, i guess code says more than worlds :) ... public static final String EXACT_SEARCH_FORMAT = \%s\; public static final String MULTIPLE_CHARACTER_WILDCARD = *; ... if (isExactCriteriaString(userCriteria)) { String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT, escape(userCriteria.substring(1, userCriteria.length() - 1))); userCriteriaProcessed = userCriteriaEscaped; } else { userCriteriaProcessed = escape(userCriteria); if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) { userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD; } } ... public static String escape(String s) { String result = s; if (s != null !s.trim().isEmpty()) { String toEscape = s.trim(); if (toEscape.contains(*)) { StringBuilder sb = new StringBuilder(); for (int i = 0; i toEscape.length(); i++) { char curChar = toEscape.charAt(i); if (curChar == '*') sb.append('*'); else sb.append(QueryParser.escape(toEscape.substring(i, i + 1))); } result = sb.toString(); } else { result = QueryParser.escape(toEscape); } } return result; } ... Thanks and Kind regards On Wed, Jul 22, 2015 at 11:04 AM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: As a start Diego, how do you currently parse the user query to build the Lucene queries ? Cheers 2015-07-22 8:35 GMT+01:00 Diego Socaceti socac...@gmail.com: Hi Alessandro, yes, i want the user to be able to surround the query with to run the phrase query with a NOT tokenized phrase. What do i have to do? Thanks and Kind regards On Tue, Jul 21, 2015 at 2:47 PM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Hey Jack, reading the doc : Set to true if phrase queries will be automatically generated when the analyzer returns more than one term from whitespace delimited text. NOTE: this behavior may not be suitable for all languages. Set to false if phrase queries should only be generated when surrounded by double quotes. In the user case , i guess he's likely to use double quotes. The only problem he sees so far is that the phrase query uses the query time analyser to actually split the tokens. First we need a feedback from him, but I guess he would like to have the phrase query, to not tokenise the text within the double quotes. In the case we should find a way. Cheers 2015-07-21 13:12 GMT+01:00 Jack Krupansky jack.krupan...@gmail.com : If you don't
Re: Analyzer for supporting hyphenated words
Hi Alessandro, yes, i want the user to be able to surround the query with to run the phrase query with a NOT tokenized phrase. What do i have to do? Thanks and Kind regards On Tue, Jul 21, 2015 at 2:47 PM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Hey Jack, reading the doc : Set to true if phrase queries will be automatically generated when the analyzer returns more than one term from whitespace delimited text. NOTE: this behavior may not be suitable for all languages. Set to false if phrase queries should only be generated when surrounded by double quotes. In the user case , i guess he's likely to use double quotes. The only problem he sees so far is that the phrase query uses the query time analyser to actually split the tokens. First we need a feedback from him, but I guess he would like to have the phrase query, to not tokenise the text within the double quotes. In the case we should find a way. Cheers 2015-07-21 13:12 GMT+01:00 Jack Krupansky jack.krupan...@gmail.com: If you don't explicitly enable automatic phrase queries, the Lucene query parser will assume an OR operator on the sub-terms when a white space-delimited term analyzes into a sequence of terms. See: https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean) -- Jack Krupansky On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti socac...@gmail.com wrote: Hi all, i'm new to lucene and tried to write my own analyzer to support hyphenated words like wi-fi, jean-pierre, etc. For our customer it is important to find the word - wi-fi by wi, fi, wifi, wi-fi - jean-pierre by jean, pierre, jean-pierre, jean-* The analyzer: public class SupportHyphenatedWordsAnalyzer extends Analyzer { protected NormalizeCharMap charConvertMap; public MinLuceneAnalyzer() { initCharConvertMap(); } protected void initCharConvertMap() { NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder(); builder.add(\, ); charConvertMap = builder.build(); } @Override protected TokenStreamComponents createComponents(final String fieldName) { final Tokenizer src = new WhitespaceTokenizer(); TokenStream tok = new WordDelimiterFilter(src, WordDelimiterFilter.PRESERVE_ORIGINAL | WordDelimiterFilter.GENERATE_WORD_PARTS | WordDelimiterFilter.GENERATE_NUMBER_PARTS | WordDelimiterFilter.CATENATE_WORDS, null); tok = new LowerCaseFilter(tok); tok = new LengthFilter(tok, 1, 255); tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET); return new TokenStreamComponents(src, tok); } @Override protected Reader initReader(String fieldName, Reader reader) { return new MappingCharFilter(charConvertMap, reader); } } The analyzer seems to work except for exact phrase match queries. e.g. the following words are indexed FD-A320-REC-SIM-1 FD-A320-REC-SIM-10 FD-A320-REC-SIM-11 MIA-FD-A320-REC-SIM-1 SIN-FD-A320-REC-SIM-1 The (exact) query FD-A320-REC-SIM-1 returns FD-A320-REC-SIM-1 MIA-FD-A320-REC-SIM-1 SIN-FD-A320-REC-SIM-1 for our customer this is wrong because this exact phrase match query should only return the single entry FD-A320-REC-SIM-1 Do you have any ideas or tips, how we have to change our current analyzer to support this requirement??? Thanks and Kind regards Diego -- -- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Analyzer for supporting hyphenated words
As a start Diego, how do you currently parse the user query to build the Lucene queries ? Cheers 2015-07-22 8:35 GMT+01:00 Diego Socaceti socac...@gmail.com: Hi Alessandro, yes, i want the user to be able to surround the query with to run the phrase query with a NOT tokenized phrase. What do i have to do? Thanks and Kind regards On Tue, Jul 21, 2015 at 2:47 PM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Hey Jack, reading the doc : Set to true if phrase queries will be automatically generated when the analyzer returns more than one term from whitespace delimited text. NOTE: this behavior may not be suitable for all languages. Set to false if phrase queries should only be generated when surrounded by double quotes. In the user case , i guess he's likely to use double quotes. The only problem he sees so far is that the phrase query uses the query time analyser to actually split the tokens. First we need a feedback from him, but I guess he would like to have the phrase query, to not tokenise the text within the double quotes. In the case we should find a way. Cheers 2015-07-21 13:12 GMT+01:00 Jack Krupansky jack.krupan...@gmail.com: If you don't explicitly enable automatic phrase queries, the Lucene query parser will assume an OR operator on the sub-terms when a white space-delimited term analyzes into a sequence of terms. See: https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean) -- Jack Krupansky On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti socac...@gmail.com wrote: Hi all, i'm new to lucene and tried to write my own analyzer to support hyphenated words like wi-fi, jean-pierre, etc. For our customer it is important to find the word - wi-fi by wi, fi, wifi, wi-fi - jean-pierre by jean, pierre, jean-pierre, jean-* The analyzer: public class SupportHyphenatedWordsAnalyzer extends Analyzer { protected NormalizeCharMap charConvertMap; public MinLuceneAnalyzer() { initCharConvertMap(); } protected void initCharConvertMap() { NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder(); builder.add(\, ); charConvertMap = builder.build(); } @Override protected TokenStreamComponents createComponents(final String fieldName) { final Tokenizer src = new WhitespaceTokenizer(); TokenStream tok = new WordDelimiterFilter(src, WordDelimiterFilter.PRESERVE_ORIGINAL | WordDelimiterFilter.GENERATE_WORD_PARTS | WordDelimiterFilter.GENERATE_NUMBER_PARTS | WordDelimiterFilter.CATENATE_WORDS, null); tok = new LowerCaseFilter(tok); tok = new LengthFilter(tok, 1, 255); tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET); return new TokenStreamComponents(src, tok); } @Override protected Reader initReader(String fieldName, Reader reader) { return new MappingCharFilter(charConvertMap, reader); } } The analyzer seems to work except for exact phrase match queries. e.g. the following words are indexed FD-A320-REC-SIM-1 FD-A320-REC-SIM-10 FD-A320-REC-SIM-11 MIA-FD-A320-REC-SIM-1 SIN-FD-A320-REC-SIM-1 The (exact) query FD-A320-REC-SIM-1 returns FD-A320-REC-SIM-1 MIA-FD-A320-REC-SIM-1 SIN-FD-A320-REC-SIM-1 for our customer this is wrong because this exact phrase match query should only return the single entry FD-A320-REC-SIM-1 Do you have any ideas or tips, how we have to change our current analyzer to support this requirement??? Thanks and Kind regards Diego -- -- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England -- -- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Analyzer for supporting hyphenated words
If you don't explicitly enable automatic phrase queries, the Lucene query parser will assume an OR operator on the sub-terms when a white space-delimited term analyzes into a sequence of terms. See: https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean) -- Jack Krupansky On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti socac...@gmail.com wrote: Hi all, i'm new to lucene and tried to write my own analyzer to support hyphenated words like wi-fi, jean-pierre, etc. For our customer it is important to find the word - wi-fi by wi, fi, wifi, wi-fi - jean-pierre by jean, pierre, jean-pierre, jean-* The analyzer: public class SupportHyphenatedWordsAnalyzer extends Analyzer { protected NormalizeCharMap charConvertMap; public MinLuceneAnalyzer() { initCharConvertMap(); } protected void initCharConvertMap() { NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder(); builder.add(\, ); charConvertMap = builder.build(); } @Override protected TokenStreamComponents createComponents(final String fieldName) { final Tokenizer src = new WhitespaceTokenizer(); TokenStream tok = new WordDelimiterFilter(src, WordDelimiterFilter.PRESERVE_ORIGINAL | WordDelimiterFilter.GENERATE_WORD_PARTS | WordDelimiterFilter.GENERATE_NUMBER_PARTS | WordDelimiterFilter.CATENATE_WORDS, null); tok = new LowerCaseFilter(tok); tok = new LengthFilter(tok, 1, 255); tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET); return new TokenStreamComponents(src, tok); } @Override protected Reader initReader(String fieldName, Reader reader) { return new MappingCharFilter(charConvertMap, reader); } } The analyzer seems to work except for exact phrase match queries. e.g. the following words are indexed FD-A320-REC-SIM-1 FD-A320-REC-SIM-10 FD-A320-REC-SIM-11 MIA-FD-A320-REC-SIM-1 SIN-FD-A320-REC-SIM-1 The (exact) query FD-A320-REC-SIM-1 returns FD-A320-REC-SIM-1 MIA-FD-A320-REC-SIM-1 SIN-FD-A320-REC-SIM-1 for our customer this is wrong because this exact phrase match query should only return the single entry FD-A320-REC-SIM-1 Do you have any ideas or tips, how we have to change our current analyzer to support this requirement??? Thanks and Kind regards Diego
Re: Analyzer for supporting hyphenated words
Hi Diego, let me try to help : I find this a little bit confused : For our customer it is important to find the word - *wi-fi* by wi, *fi*, wifi, wi-fi - jean-pierre by jean, pierre, jean-pierre, jean-* But : The (exact) query *FD-A320-REC-SIM-1* returns FD-A320-REC-SIM-1 MIA-*FD-A320-REC-SIM-1* SIN-FD-A320-REC-SIM-1 for our customer this is wrong because this exact phrase match query should only return the single entry FD-A320-REC-SIM-1 If you noticed the suffix fi in the first example can be compared to the suffix FD-A320-REC-SIM-1 in the second. To qualify your requirement : Do you want the user to be able to surround the query with to run the phrase query with a NOT tokenized phrase ? Because by default , a phrase query is tokenized like the others, but term positions affect the matching ! In the case I identified your requirement, we can have a think to a solution! Cheers 2015-07-17 9:41 GMT+01:00 Diego Socaceti socac...@gmail.com: Hi all, i'm new to lucene and tried to write my own analyzer to support hyphenated words like wi-fi, jean-pierre, etc. For our customer it is important to find the word - wi-fi by wi, fi, wifi, wi-fi - jean-pierre by jean, pierre, jean-pierre, jean-* The analyzer: public class SupportHyphenatedWordsAnalyzer extends Analyzer { protected NormalizeCharMap charConvertMap; public MinLuceneAnalyzer() { initCharConvertMap(); } protected void initCharConvertMap() { NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder(); builder.add(\, ); charConvertMap = builder.build(); } @Override protected TokenStreamComponents createComponents(final String fieldName) { final Tokenizer src = new WhitespaceTokenizer(); TokenStream tok = new WordDelimiterFilter(src, WordDelimiterFilter.PRESERVE_ORIGINAL | WordDelimiterFilter.GENERATE_WORD_PARTS | WordDelimiterFilter.GENERATE_NUMBER_PARTS | WordDelimiterFilter.CATENATE_WORDS, null); tok = new LowerCaseFilter(tok); tok = new LengthFilter(tok, 1, 255); tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET); return new TokenStreamComponents(src, tok); } @Override protected Reader initReader(String fieldName, Reader reader) { return new MappingCharFilter(charConvertMap, reader); } } The analyzer seems to work except for exact phrase match queries. e.g. the following words are indexed FD-A320-REC-SIM-1 FD-A320-REC-SIM-10 FD-A320-REC-SIM-11 MIA-FD-A320-REC-SIM-1 SIN-FD-A320-REC-SIM-1 The (exact) query FD-A320-REC-SIM-1 returns FD-A320-REC-SIM-1 MIA-FD-A320-REC-SIM-1 SIN-FD-A320-REC-SIM-1 for our customer this is wrong because this exact phrase match query should only return the single entry FD-A320-REC-SIM-1 Do you have any ideas or tips, how we have to change our current analyzer to support this requirement??? Thanks and Kind regards Diego -- -- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Analyzer for supporting hyphenated words
Hey Jack, reading the doc : Set to true if phrase queries will be automatically generated when the analyzer returns more than one term from whitespace delimited text. NOTE: this behavior may not be suitable for all languages. Set to false if phrase queries should only be generated when surrounded by double quotes. In the user case , i guess he's likely to use double quotes. The only problem he sees so far is that the phrase query uses the query time analyser to actually split the tokens. First we need a feedback from him, but I guess he would like to have the phrase query, to not tokenise the text within the double quotes. In the case we should find a way. Cheers 2015-07-21 13:12 GMT+01:00 Jack Krupansky jack.krupan...@gmail.com: If you don't explicitly enable automatic phrase queries, the Lucene query parser will assume an OR operator on the sub-terms when a white space-delimited term analyzes into a sequence of terms. See: https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean) -- Jack Krupansky On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti socac...@gmail.com wrote: Hi all, i'm new to lucene and tried to write my own analyzer to support hyphenated words like wi-fi, jean-pierre, etc. For our customer it is important to find the word - wi-fi by wi, fi, wifi, wi-fi - jean-pierre by jean, pierre, jean-pierre, jean-* The analyzer: public class SupportHyphenatedWordsAnalyzer extends Analyzer { protected NormalizeCharMap charConvertMap; public MinLuceneAnalyzer() { initCharConvertMap(); } protected void initCharConvertMap() { NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder(); builder.add(\, ); charConvertMap = builder.build(); } @Override protected TokenStreamComponents createComponents(final String fieldName) { final Tokenizer src = new WhitespaceTokenizer(); TokenStream tok = new WordDelimiterFilter(src, WordDelimiterFilter.PRESERVE_ORIGINAL | WordDelimiterFilter.GENERATE_WORD_PARTS | WordDelimiterFilter.GENERATE_NUMBER_PARTS | WordDelimiterFilter.CATENATE_WORDS, null); tok = new LowerCaseFilter(tok); tok = new LengthFilter(tok, 1, 255); tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET); return new TokenStreamComponents(src, tok); } @Override protected Reader initReader(String fieldName, Reader reader) { return new MappingCharFilter(charConvertMap, reader); } } The analyzer seems to work except for exact phrase match queries. e.g. the following words are indexed FD-A320-REC-SIM-1 FD-A320-REC-SIM-10 FD-A320-REC-SIM-11 MIA-FD-A320-REC-SIM-1 SIN-FD-A320-REC-SIM-1 The (exact) query FD-A320-REC-SIM-1 returns FD-A320-REC-SIM-1 MIA-FD-A320-REC-SIM-1 SIN-FD-A320-REC-SIM-1 for our customer this is wrong because this exact phrase match query should only return the single entry FD-A320-REC-SIM-1 Do you have any ideas or tips, how we have to change our current analyzer to support this requirement??? Thanks and Kind regards Diego -- -- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England