subject:"Re\: Analyzer for supporting hyphenated words"

Re: Analyzer for supporting hyphenated words

2015-07-24 Thread Diego Socaceti

Hi Alessandro,

after talking to our customer:
Yes, it needs to be a mix of classic and quoted queries in one userCriteria.

Before we look into the details of the QueryParser. I'm currently using
org.apache.lucene.queryparser.classic.QueryParser of 5.2.1.
Is this the right QueryParser to use?


Thanks and Kind regards


On Wed, Jul 22, 2015 at 12:50 PM, Alessandro Benedetti 
benedetti.ale...@gmail.com wrote:

 Yes what I meant is that you actually can use your analyser when the query
 is not in the quotes.
 When in the quotes you can directly build  a term Query out of it.
 Now of course it is not so simple scenario, do you think quoted query and
 not quoted query parts are 2 different set of queries, which intersection
 is always empty ? i.e. a user OR ask for a quoted query OR for a classic
 query ?
 In that scenario it will be simple.

 In the case of a mix, we should take a look better to the lucene query
 parser code and see how the tokenization of content within quotes is
 handled.

 Cheers

 2015-07-22 11:32 GMT+01:00 Diego Socaceti socac...@gmail.com:

  sorry little code refactoring typo: curTokenProcessed should be
  userCriteriaProcessed
 
  ...
 
  public static final String EXACT_SEARCH_FORMAT = \%s\;
  public static final String MULTIPLE_CHARACTER_WILDCARD = *;
 
  ...
 
if (isExactCriteriaString(userCriteria)) {
  String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
  escape(userCriteria.substring(1, userCriteria.length() - 1)));
  userCriteriaProcessed = userCriteriaEscaped;
} else {
  userCriteriaProcessed = escape(userCriteria);
 
  if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
  }
}
 
 
String queryStr = ;
 
for (String fieldName : fields) {
  String escapedFieldName = escape(fieldName);
  queryStr += String.format(%s:%s , escapedFieldName,
  userCriteriaProcessed);
}
 
query = new QueryParser(, analyzer).parse(queryStr.trim());
 
  ...
 
  On Wed, Jul 22, 2015 at 12:27 PM, Diego Socaceti socac...@gmail.com
  wrote:
 
   Hi Alessandro,
  
   sorry, that i forgot the important part. Here it is:
  
   ...
  
   public static final String EXACT_SEARCH_FORMAT = \%s\;
   public static final String MULTIPLE_CHARACTER_WILDCARD = *;
  
   ...
  
 if (isExactCriteriaString(userCriteria)) {
   String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
   escape(userCriteria.substring(1, userCriteria.length() - 1)));
   userCriteriaProcessed = userCriteriaEscaped;
 } else {
   userCriteriaProcessed = escape(userCriteria);
  
   if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
 userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
   }
 }
  
  
 String queryStr = ;
  
 for (String fieldName : fields) {
   String escapedFieldName = escape(fieldName);
   queryStr += String.format(%s:%s , escapedFieldName,
   curTokenProcessed);
 }
  
 query = new QueryParser(, analyzer).parse(queryStr.trim());
  
   ...
  
  
   As far as i understand my problem is, that in my - naive query syntax
   based solution -
   i have to use my analyzer, which means that the userCriteria is always
   tokenized.
  
   You suggest to use the java query classes to build the query, because
  than
   i can
   control if the userCriteria will be tokenized or not.
   Did i get you right?
  
  
   Thanks and Kind regards
  
   On Wed, Jul 22, 2015 at 11:44 AM, Alessandro Benedetti 
   benedetti.ale...@gmail.com wrote:
  
   I read briefly, correct me if I am wrong, but that is to parse the
  content
   within the quotes  .
   But we are still at a String level.
   I want to see how you build the phraseQuery :)
   Taking a look to the code the PhraseQuery allow you to add as many
 terms
   you want.
  
   What you need to do, it's to not tokenise the content within the
 quotes
   and
   create actually a TermQuery ( in your case you are not even using the
   feature offered by the phrase query regarding positions, you simply
 want
   to
   run a TermQuery) .
  
   So to clarify you should parse the content within the quotes ( as you
  are
   doing), than building a TermQuery out of that String, not tokenized at
   all.
  
   Does this make sense to you ?
   Can I see what you do after identifying the content within the quotes
 ?
  
   Cheers
  
  
   2015-07-22 10:20 GMT+01:00 Diego Socaceti socac...@gmail.com:
  
Hi Alessandro,
   
i guess code says more than worlds :)
   
...
   
public static final String EXACT_SEARCH_FORMAT = \%s\;
public static final String MULTIPLE_CHARACTER_WILDCARD = *;
   
...
   
  if (isExactCriteriaString(userCriteria)) {
String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
escape(userCriteria.substring(1, userCriteria.length() -
 1)));
userCriteriaProcessed = userCriteriaEscaped;
  } else {

Re: Analyzer for supporting hyphenated words

2015-07-22 Thread Diego Socaceti

Hi Alessandro,

i guess code says more than worlds :)

...

public static final String EXACT_SEARCH_FORMAT = \%s\;
public static final String MULTIPLE_CHARACTER_WILDCARD = *;

...

  if (isExactCriteriaString(userCriteria)) {
String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
escape(userCriteria.substring(1, userCriteria.length() - 1)));
userCriteriaProcessed = userCriteriaEscaped;
  } else {
userCriteriaProcessed = escape(userCriteria);

if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
  userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
}
  }

...

public static String escape(String s) {
  String result = s;

  if (s != null  !s.trim().isEmpty()) {
String toEscape = s.trim();

if (toEscape.contains(*)) {
  StringBuilder sb = new StringBuilder();

  for (int i = 0; i  toEscape.length(); i++) {
char curChar = toEscape.charAt(i);
if (curChar == '*')
  sb.append('*');
else
  sb.append(QueryParser.escape(toEscape.substring(i, i + 1)));
  }

  result = sb.toString();
} else {
  result = QueryParser.escape(toEscape);
}
  }

  return result;
}

...

Thanks and Kind regards



On Wed, Jul 22, 2015 at 11:04 AM, Alessandro Benedetti 
benedetti.ale...@gmail.com wrote:

 As a start Diego, how do you currently parse the user query to build the
 Lucene queries ?

 Cheers

 2015-07-22 8:35 GMT+01:00 Diego Socaceti socac...@gmail.com:

  Hi Alessandro,
 
  yes, i want the user to be able to surround the query with  to run the
  phrase query with a NOT tokenized phrase.
 
  What do i have to do?
 
  Thanks and Kind regards
 
  On Tue, Jul 21, 2015 at 2:47 PM, Alessandro Benedetti 
  benedetti.ale...@gmail.com wrote:
 
   Hey Jack, reading the doc :
  
Set to true if phrase queries will be automatically generated when
 the
   analyzer returns more than one term from whitespace delimited text.
 NOTE:
   this behavior may not be suitable for all languages.
  
   Set to false if phrase queries should only be generated when surrounded
  by
   double quotes.
  
  
   In the user case , i guess he's likely to use double quotes.
  
   The only problem he sees so far is that the phrase query uses the query
   time analyser to actually split the tokens.
  
   First we need a feedback from him, but I guess he would like to have
 the
   phrase query, to not tokenise the text within the double quotes.
  
   In the case we should find a way.
  
  
   Cheers
  
   2015-07-21 13:12 GMT+01:00 Jack Krupansky jack.krupan...@gmail.com:
  
If you don't explicitly enable automatic phrase queries, the Lucene
  query
parser will assume an OR operator on the sub-terms when a white
space-delimited term analyzes into a sequence of terms.
   
See:
   
   
  
 
 https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean)
   
   
-- Jack Krupansky
   
On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti socac...@gmail.com
wrote:
   
 Hi all,

 i'm new to lucene and tried to write my own analyzer to support
 hyphenated words like wi-fi, jean-pierre, etc.
 For our customer it is important to find the word
 - wi-fi by wi, fi, wifi, wi-fi
 - jean-pierre by jean, pierre, jean-pierre, jean-*




 The analyzer:
 public class SupportHyphenatedWordsAnalyzer extends Analyzer {

   protected NormalizeCharMap charConvertMap;

   public MinLuceneAnalyzer() {
 initCharConvertMap();
   }

   protected void initCharConvertMap() {
 NormalizeCharMap.Builder builder = new
  NormalizeCharMap.Builder();
 builder.add(\, );
 charConvertMap = builder.build();
   }

   @Override
   protected TokenStreamComponents createComponents(final String
fieldName)
 {

 final Tokenizer src = new WhitespaceTokenizer();

 TokenStream tok = new WordDelimiterFilter(src,
 WordDelimiterFilter.PRESERVE_ORIGINAL
 | WordDelimiterFilter.GENERATE_WORD_PARTS
 | WordDelimiterFilter.GENERATE_NUMBER_PARTS
 | WordDelimiterFilter.CATENATE_WORDS,
 null);
 tok = new LowerCaseFilter(tok);
 tok = new LengthFilter(tok, 1, 255);
 tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET);

 return new TokenStreamComponents(src, tok);
   }

   @Override
   protected Reader initReader(String fieldName, Reader reader) {
 return new MappingCharFilter(charConvertMap, reader);
   }
 }





 The analyzer seems to work except for exact phrase match queries.

 e.g. the following words are indexed

 FD-A320-REC-SIM-1
 FD-A320-REC-SIM-10
 FD-A320-REC-SIM-11
 MIA-FD-A320-REC-SIM-1
 SIN-FD-A320-REC-SIM-1


 The

Re: Analyzer for supporting hyphenated words

2015-07-22 Thread Diego Socaceti

sorry little code refactoring typo: curTokenProcessed should be
userCriteriaProcessed

...

public static final String EXACT_SEARCH_FORMAT = \%s\;
public static final String MULTIPLE_CHARACTER_WILDCARD = *;

...

  if (isExactCriteriaString(userCriteria)) {
String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
escape(userCriteria.substring(1, userCriteria.length() - 1)));
userCriteriaProcessed = userCriteriaEscaped;
  } else {
userCriteriaProcessed = escape(userCriteria);

if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
  userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
}
  }


  String queryStr = ;

  for (String fieldName : fields) {
String escapedFieldName = escape(fieldName);
queryStr += String.format(%s:%s , escapedFieldName,
userCriteriaProcessed);
  }

  query = new QueryParser(, analyzer).parse(queryStr.trim());

...

On Wed, Jul 22, 2015 at 12:27 PM, Diego Socaceti socac...@gmail.com wrote:

 Hi Alessandro,

 sorry, that i forgot the important part. Here it is:

 ...

 public static final String EXACT_SEARCH_FORMAT = \%s\;
 public static final String MULTIPLE_CHARACTER_WILDCARD = *;

 ...

   if (isExactCriteriaString(userCriteria)) {
 String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
 escape(userCriteria.substring(1, userCriteria.length() - 1)));
 userCriteriaProcessed = userCriteriaEscaped;
   } else {
 userCriteriaProcessed = escape(userCriteria);

 if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
   userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
 }
   }


   String queryStr = ;

   for (String fieldName : fields) {
 String escapedFieldName = escape(fieldName);
 queryStr += String.format(%s:%s , escapedFieldName,
 curTokenProcessed);
   }

   query = new QueryParser(, analyzer).parse(queryStr.trim());

 ...


 As far as i understand my problem is, that in my - naive query syntax
 based solution -
 i have to use my analyzer, which means that the userCriteria is always
 tokenized.

 You suggest to use the java query classes to build the query, because than
 i can
 control if the userCriteria will be tokenized or not.
 Did i get you right?


 Thanks and Kind regards

 On Wed, Jul 22, 2015 at 11:44 AM, Alessandro Benedetti 
 benedetti.ale...@gmail.com wrote:

 I read briefly, correct me if I am wrong, but that is to parse the content
 within the quotes  .
 But we are still at a String level.
 I want to see how you build the phraseQuery :)
 Taking a look to the code the PhraseQuery allow you to add as many terms
 you want.

 What you need to do, it's to not tokenise the content within the quotes
 and
 create actually a TermQuery ( in your case you are not even using the
 feature offered by the phrase query regarding positions, you simply want
 to
 run a TermQuery) .

 So to clarify you should parse the content within the quotes ( as you are
 doing), than building a TermQuery out of that String, not tokenized at
 all.

 Does this make sense to you ?
 Can I see what you do after identifying the content within the quotes ?

 Cheers


 2015-07-22 10:20 GMT+01:00 Diego Socaceti socac...@gmail.com:

  Hi Alessandro,
 
  i guess code says more than worlds :)
 
  ...
 
  public static final String EXACT_SEARCH_FORMAT = \%s\;
  public static final String MULTIPLE_CHARACTER_WILDCARD = *;
 
  ...
 
if (isExactCriteriaString(userCriteria)) {
  String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
  escape(userCriteria.substring(1, userCriteria.length() - 1)));
  userCriteriaProcessed = userCriteriaEscaped;
} else {
  userCriteriaProcessed = escape(userCriteria);
 
  if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
  }
}
 
  ...
 
  public static String escape(String s) {
String result = s;
 
if (s != null  !s.trim().isEmpty()) {
  String toEscape = s.trim();
 
  if (toEscape.contains(*)) {
StringBuilder sb = new StringBuilder();
 
for (int i = 0; i  toEscape.length(); i++) {
  char curChar = toEscape.charAt(i);
  if (curChar == '*')
sb.append('*');
  else
sb.append(QueryParser.escape(toEscape.substring(i, i + 1)));
}
 
result = sb.toString();
  } else {
result = QueryParser.escape(toEscape);
  }
}
 
return result;
  }
 
  ...
 
  Thanks and Kind regards
 
 
 
  On Wed, Jul 22, 2015 at 11:04 AM, Alessandro Benedetti 
  benedetti.ale...@gmail.com wrote:
 
   As a start Diego, how do you currently parse the user query to build
 the
   Lucene queries ?
  
   Cheers
  
   2015-07-22 8:35 GMT+01:00 Diego Socaceti socac...@gmail.com:
  
Hi Alessandro,
   
yes, i want the user to be able to surround the query with  to run
  the
phrase query with a NOT tokenized phrase.
   
What do i have to do?
   
Thanks and Kind regards

Re: Analyzer for supporting hyphenated words

2015-07-22 Thread Alessandro Benedetti

Yes what I meant is that you actually can use your analyser when the query
is not in the quotes.
When in the quotes you can directly build  a term Query out of it.
Now of course it is not so simple scenario, do you think quoted query and
not quoted query parts are 2 different set of queries, which intersection
is always empty ? i.e. a user OR ask for a quoted query OR for a classic
query ?
In that scenario it will be simple.

In the case of a mix, we should take a look better to the lucene query
parser code and see how the tokenization of content within quotes is
handled.

Cheers

2015-07-22 11:32 GMT+01:00 Diego Socaceti socac...@gmail.com:

 sorry little code refactoring typo: curTokenProcessed should be
 userCriteriaProcessed

 ...

 public static final String EXACT_SEARCH_FORMAT = \%s\;
 public static final String MULTIPLE_CHARACTER_WILDCARD = *;

 ...

   if (isExactCriteriaString(userCriteria)) {
 String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
 escape(userCriteria.substring(1, userCriteria.length() - 1)));
 userCriteriaProcessed = userCriteriaEscaped;
   } else {
 userCriteriaProcessed = escape(userCriteria);

 if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
   userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
 }
   }


   String queryStr = ;

   for (String fieldName : fields) {
 String escapedFieldName = escape(fieldName);
 queryStr += String.format(%s:%s , escapedFieldName,
 userCriteriaProcessed);
   }

   query = new QueryParser(, analyzer).parse(queryStr.trim());

 ...

 On Wed, Jul 22, 2015 at 12:27 PM, Diego Socaceti socac...@gmail.com
 wrote:

  Hi Alessandro,
 
  sorry, that i forgot the important part. Here it is:
 
  ...
 
  public static final String EXACT_SEARCH_FORMAT = \%s\;
  public static final String MULTIPLE_CHARACTER_WILDCARD = *;
 
  ...
 
if (isExactCriteriaString(userCriteria)) {
  String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
  escape(userCriteria.substring(1, userCriteria.length() - 1)));
  userCriteriaProcessed = userCriteriaEscaped;
} else {
  userCriteriaProcessed = escape(userCriteria);
 
  if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
  }
}
 
 
String queryStr = ;
 
for (String fieldName : fields) {
  String escapedFieldName = escape(fieldName);
  queryStr += String.format(%s:%s , escapedFieldName,
  curTokenProcessed);
}
 
query = new QueryParser(, analyzer).parse(queryStr.trim());
 
  ...
 
 
  As far as i understand my problem is, that in my - naive query syntax
  based solution -
  i have to use my analyzer, which means that the userCriteria is always
  tokenized.
 
  You suggest to use the java query classes to build the query, because
 than
  i can
  control if the userCriteria will be tokenized or not.
  Did i get you right?
 
 
  Thanks and Kind regards
 
  On Wed, Jul 22, 2015 at 11:44 AM, Alessandro Benedetti 
  benedetti.ale...@gmail.com wrote:
 
  I read briefly, correct me if I am wrong, but that is to parse the
 content
  within the quotes  .
  But we are still at a String level.
  I want to see how you build the phraseQuery :)
  Taking a look to the code the PhraseQuery allow you to add as many terms
  you want.
 
  What you need to do, it's to not tokenise the content within the quotes
  and
  create actually a TermQuery ( in your case you are not even using the
  feature offered by the phrase query regarding positions, you simply want
  to
  run a TermQuery) .
 
  So to clarify you should parse the content within the quotes ( as you
 are
  doing), than building a TermQuery out of that String, not tokenized at
  all.
 
  Does this make sense to you ?
  Can I see what you do after identifying the content within the quotes ?
 
  Cheers
 
 
  2015-07-22 10:20 GMT+01:00 Diego Socaceti socac...@gmail.com:
 
   Hi Alessandro,
  
   i guess code says more than worlds :)
  
   ...
  
   public static final String EXACT_SEARCH_FORMAT = \%s\;
   public static final String MULTIPLE_CHARACTER_WILDCARD = *;
  
   ...
  
 if (isExactCriteriaString(userCriteria)) {
   String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
   escape(userCriteria.substring(1, userCriteria.length() - 1)));
   userCriteriaProcessed = userCriteriaEscaped;
 } else {
   userCriteriaProcessed = escape(userCriteria);
  
   if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
 userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
   }
 }
  
   ...
  
   public static String escape(String s) {
 String result = s;
  
 if (s != null  !s.trim().isEmpty()) {
   String toEscape = s.trim();
  
   if (toEscape.contains(*)) {
 StringBuilder sb = new StringBuilder();
  
 for (int i = 0; i  toEscape.length(); i++) {
   char curChar = toEscape.charAt(i);
   if (curChar ==

Re: Analyzer for supporting hyphenated words

2015-07-22 Thread Alessandro Benedetti

I read briefly, correct me if I am wrong, but that is to parse the content
within the quotes  .
But we are still at a String level.
I want to see how you build the phraseQuery :)
Taking a look to the code the PhraseQuery allow you to add as many terms
you want.

What you need to do, it's to not tokenise the content within the quotes and
create actually a TermQuery ( in your case you are not even using the
feature offered by the phrase query regarding positions, you simply want to
run a TermQuery) .

So to clarify you should parse the content within the quotes ( as you are
doing), than building a TermQuery out of that String, not tokenized at all.

Does this make sense to you ?
Can I see what you do after identifying the content within the quotes ?

Cheers


2015-07-22 10:20 GMT+01:00 Diego Socaceti socac...@gmail.com:

 Hi Alessandro,

 i guess code says more than worlds :)

 ...

 public static final String EXACT_SEARCH_FORMAT = \%s\;
 public static final String MULTIPLE_CHARACTER_WILDCARD = *;

 ...

   if (isExactCriteriaString(userCriteria)) {
 String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
 escape(userCriteria.substring(1, userCriteria.length() - 1)));
 userCriteriaProcessed = userCriteriaEscaped;
   } else {
 userCriteriaProcessed = escape(userCriteria);

 if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
   userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
 }
   }

 ...

 public static String escape(String s) {
   String result = s;

   if (s != null  !s.trim().isEmpty()) {
 String toEscape = s.trim();

 if (toEscape.contains(*)) {
   StringBuilder sb = new StringBuilder();

   for (int i = 0; i  toEscape.length(); i++) {
 char curChar = toEscape.charAt(i);
 if (curChar == '*')
   sb.append('*');
 else
   sb.append(QueryParser.escape(toEscape.substring(i, i + 1)));
   }

   result = sb.toString();
 } else {
   result = QueryParser.escape(toEscape);
 }
   }

   return result;
 }

 ...

 Thanks and Kind regards



 On Wed, Jul 22, 2015 at 11:04 AM, Alessandro Benedetti 
 benedetti.ale...@gmail.com wrote:

  As a start Diego, how do you currently parse the user query to build the
  Lucene queries ?
 
  Cheers
 
  2015-07-22 8:35 GMT+01:00 Diego Socaceti socac...@gmail.com:
 
   Hi Alessandro,
  
   yes, i want the user to be able to surround the query with  to run
 the
   phrase query with a NOT tokenized phrase.
  
   What do i have to do?
  
   Thanks and Kind regards
  
   On Tue, Jul 21, 2015 at 2:47 PM, Alessandro Benedetti 
   benedetti.ale...@gmail.com wrote:
  
Hey Jack, reading the doc :
   
 Set to true if phrase queries will be automatically generated when
  the
analyzer returns more than one term from whitespace delimited text.
  NOTE:
this behavior may not be suitable for all languages.
   
Set to false if phrase queries should only be generated when
 surrounded
   by
double quotes.
   
   
In the user case , i guess he's likely to use double quotes.
   
The only problem he sees so far is that the phrase query uses the
 query
time analyser to actually split the tokens.
   
First we need a feedback from him, but I guess he would like to have
  the
phrase query, to not tokenise the text within the double quotes.
   
In the case we should find a way.
   
   
Cheers
   
2015-07-21 13:12 GMT+01:00 Jack Krupansky jack.krupan...@gmail.com
 :
   
 If you don't explicitly enable automatic phrase queries, the Lucene
   query
 parser will assume an OR operator on the sub-terms when a white
 space-delimited term analyzes into a sequence of terms.

 See:


   
  
 
 https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean)


 -- Jack Krupansky

 On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti 
 socac...@gmail.com
 wrote:

  Hi all,
 
  i'm new to lucene and tried to write my own analyzer to support
  hyphenated words like wi-fi, jean-pierre, etc.
  For our customer it is important to find the word
  - wi-fi by wi, fi, wifi, wi-fi
  - jean-pierre by jean, pierre, jean-pierre, jean-*
 
 
 
 
  The analyzer:
  public class SupportHyphenatedWordsAnalyzer extends Analyzer {
 
protected NormalizeCharMap charConvertMap;
 
public MinLuceneAnalyzer() {
  initCharConvertMap();
}
 
protected void initCharConvertMap() {
  NormalizeCharMap.Builder builder = new
   NormalizeCharMap.Builder();
  builder.add(\, );
  charConvertMap = builder.build();
}
 
@Override
protected TokenStreamComponents createComponents(final String
 fieldName)
  {
 
  final Tokenizer src = new WhitespaceTokenizer();

Re: Analyzer for supporting hyphenated words

2015-07-22 Thread Diego Socaceti

Hi Alessandro,

sorry, that i forgot the important part. Here it is:

...

public static final String EXACT_SEARCH_FORMAT = \%s\;
public static final String MULTIPLE_CHARACTER_WILDCARD = *;

...

  if (isExactCriteriaString(userCriteria)) {
String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
escape(userCriteria.substring(1, userCriteria.length() - 1)));
userCriteriaProcessed = userCriteriaEscaped;
  } else {
userCriteriaProcessed = escape(userCriteria);

if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
  userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
}
  }


  String queryStr = ;

  for (String fieldName : fields) {
String escapedFieldName = escape(fieldName);
queryStr += String.format(%s:%s , escapedFieldName,
curTokenProcessed);
  }

  query = new QueryParser(, analyzer).parse(queryStr.trim());

...


As far as i understand my problem is, that in my - naive query syntax based
solution -
i have to use my analyzer, which means that the userCriteria is always
tokenized.

You suggest to use the java query classes to build the query, because than
i can
control if the userCriteria will be tokenized or not.
Did i get you right?


Thanks and Kind regards

On Wed, Jul 22, 2015 at 11:44 AM, Alessandro Benedetti 
benedetti.ale...@gmail.com wrote:

 I read briefly, correct me if I am wrong, but that is to parse the content
 within the quotes  .
 But we are still at a String level.
 I want to see how you build the phraseQuery :)
 Taking a look to the code the PhraseQuery allow you to add as many terms
 you want.

 What you need to do, it's to not tokenise the content within the quotes and
 create actually a TermQuery ( in your case you are not even using the
 feature offered by the phrase query regarding positions, you simply want to
 run a TermQuery) .

 So to clarify you should parse the content within the quotes ( as you are
 doing), than building a TermQuery out of that String, not tokenized at all.

 Does this make sense to you ?
 Can I see what you do after identifying the content within the quotes ?

 Cheers


 2015-07-22 10:20 GMT+01:00 Diego Socaceti socac...@gmail.com:

  Hi Alessandro,
 
  i guess code says more than worlds :)
 
  ...
 
  public static final String EXACT_SEARCH_FORMAT = \%s\;
  public static final String MULTIPLE_CHARACTER_WILDCARD = *;
 
  ...
 
if (isExactCriteriaString(userCriteria)) {
  String userCriteriaEscaped = String.format(EXACT_SEARCH_FORMAT,
  escape(userCriteria.substring(1, userCriteria.length() - 1)));
  userCriteriaProcessed = userCriteriaEscaped;
} else {
  userCriteriaProcessed = escape(userCriteria);
 
  if (!userCriteria.endsWith(MULTIPLE_CHARACTER_WILDCARD)) {
userCriteriaProcessed += MULTIPLE_CHARACTER_WILDCARD;
  }
}
 
  ...
 
  public static String escape(String s) {
String result = s;
 
if (s != null  !s.trim().isEmpty()) {
  String toEscape = s.trim();
 
  if (toEscape.contains(*)) {
StringBuilder sb = new StringBuilder();
 
for (int i = 0; i  toEscape.length(); i++) {
  char curChar = toEscape.charAt(i);
  if (curChar == '*')
sb.append('*');
  else
sb.append(QueryParser.escape(toEscape.substring(i, i + 1)));
}
 
result = sb.toString();
  } else {
result = QueryParser.escape(toEscape);
  }
}
 
return result;
  }
 
  ...
 
  Thanks and Kind regards
 
 
 
  On Wed, Jul 22, 2015 at 11:04 AM, Alessandro Benedetti 
  benedetti.ale...@gmail.com wrote:
 
   As a start Diego, how do you currently parse the user query to build
 the
   Lucene queries ?
  
   Cheers
  
   2015-07-22 8:35 GMT+01:00 Diego Socaceti socac...@gmail.com:
  
Hi Alessandro,
   
yes, i want the user to be able to surround the query with  to run
  the
phrase query with a NOT tokenized phrase.
   
What do i have to do?
   
Thanks and Kind regards
   
On Tue, Jul 21, 2015 at 2:47 PM, Alessandro Benedetti 
benedetti.ale...@gmail.com wrote:
   
 Hey Jack, reading the doc :

  Set to true if phrase queries will be automatically generated
 when
   the
 analyzer returns more than one term from whitespace delimited text.
   NOTE:
 this behavior may not be suitable for all languages.

 Set to false if phrase queries should only be generated when
  surrounded
by
 double quotes.


 In the user case , i guess he's likely to use double quotes.

 The only problem he sees so far is that the phrase query uses the
  query
 time analyser to actually split the tokens.

 First we need a feedback from him, but I guess he would like to
 have
   the
 phrase query, to not tokenise the text within the double quotes.

 In the case we should find a way.


 Cheers

 2015-07-21 13:12 GMT+01:00 Jack Krupansky 
 jack.krupan...@gmail.com
  :

  If you don't

Re: Analyzer for supporting hyphenated words

2015-07-22 Thread Diego Socaceti

Hi Alessandro,

yes, i want the user to be able to surround the query with  to run the
phrase query with a NOT tokenized phrase.

What do i have to do?

Thanks and Kind regards

On Tue, Jul 21, 2015 at 2:47 PM, Alessandro Benedetti 
benedetti.ale...@gmail.com wrote:

 Hey Jack, reading the doc :

  Set to true if phrase queries will be automatically generated when the
 analyzer returns more than one term from whitespace delimited text. NOTE:
 this behavior may not be suitable for all languages.

 Set to false if phrase queries should only be generated when surrounded by
 double quotes.


 In the user case , i guess he's likely to use double quotes.

 The only problem he sees so far is that the phrase query uses the query
 time analyser to actually split the tokens.

 First we need a feedback from him, but I guess he would like to have the
 phrase query, to not tokenise the text within the double quotes.

 In the case we should find a way.


 Cheers

 2015-07-21 13:12 GMT+01:00 Jack Krupansky jack.krupan...@gmail.com:

  If you don't explicitly enable automatic phrase queries, the Lucene query
  parser will assume an OR operator on the sub-terms when a white
  space-delimited term analyzes into a sequence of terms.
 
  See:
 
 
 https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean)
 
 
  -- Jack Krupansky
 
  On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti socac...@gmail.com
  wrote:
 
   Hi all,
  
   i'm new to lucene and tried to write my own analyzer to support
   hyphenated words like wi-fi, jean-pierre, etc.
   For our customer it is important to find the word
   - wi-fi by wi, fi, wifi, wi-fi
   - jean-pierre by jean, pierre, jean-pierre, jean-*
  
  
  
  
   The analyzer:
   public class SupportHyphenatedWordsAnalyzer extends Analyzer {
  
 protected NormalizeCharMap charConvertMap;
  
 public MinLuceneAnalyzer() {
   initCharConvertMap();
 }
  
 protected void initCharConvertMap() {
   NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
   builder.add(\, );
   charConvertMap = builder.build();
 }
  
 @Override
 protected TokenStreamComponents createComponents(final String
  fieldName)
   {
  
   final Tokenizer src = new WhitespaceTokenizer();
  
   TokenStream tok = new WordDelimiterFilter(src,
   WordDelimiterFilter.PRESERVE_ORIGINAL
   | WordDelimiterFilter.GENERATE_WORD_PARTS
   | WordDelimiterFilter.GENERATE_NUMBER_PARTS
   | WordDelimiterFilter.CATENATE_WORDS,
   null);
   tok = new LowerCaseFilter(tok);
   tok = new LengthFilter(tok, 1, 255);
   tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
  
   return new TokenStreamComponents(src, tok);
 }
  
 @Override
 protected Reader initReader(String fieldName, Reader reader) {
   return new MappingCharFilter(charConvertMap, reader);
 }
   }
  
  
  
  
  
   The analyzer seems to work except for exact phrase match queries.
  
   e.g. the following words are indexed
  
   FD-A320-REC-SIM-1
   FD-A320-REC-SIM-10
   FD-A320-REC-SIM-11
   MIA-FD-A320-REC-SIM-1
   SIN-FD-A320-REC-SIM-1
  
  
   The (exact) query FD-A320-REC-SIM-1 returns
   FD-A320-REC-SIM-1
   MIA-FD-A320-REC-SIM-1
   SIN-FD-A320-REC-SIM-1
  
   for our customer this is wrong because this exact phrase match
   query should only return the single entry FD-A320-REC-SIM-1
  
   Do you have any ideas or tips, how we have to change our current
   analyzer to support this requirement???
  
  
   Thanks and Kind regards
   Diego
  
 



 --
 --

 Benedetti Alessandro
 Visiting card - http://about.me/alessandro_benedetti
 Blog - http://alexbenedetti.blogspot.co.uk

 Tyger, tyger burning bright
 In the forests of the night,
 What immortal hand or eye
 Could frame thy fearful symmetry?

 William Blake - Songs of Experience -1794 England

Re: Analyzer for supporting hyphenated words

2015-07-22 Thread Alessandro Benedetti

As a start Diego, how do you currently parse the user query to build the
Lucene queries ?

Cheers

2015-07-22 8:35 GMT+01:00 Diego Socaceti socac...@gmail.com:

 Hi Alessandro,

 yes, i want the user to be able to surround the query with  to run the
 phrase query with a NOT tokenized phrase.

 What do i have to do?

 Thanks and Kind regards

 On Tue, Jul 21, 2015 at 2:47 PM, Alessandro Benedetti 
 benedetti.ale...@gmail.com wrote:

  Hey Jack, reading the doc :
 
   Set to true if phrase queries will be automatically generated when the
  analyzer returns more than one term from whitespace delimited text. NOTE:
  this behavior may not be suitable for all languages.
 
  Set to false if phrase queries should only be generated when surrounded
 by
  double quotes.
 
 
  In the user case , i guess he's likely to use double quotes.
 
  The only problem he sees so far is that the phrase query uses the query
  time analyser to actually split the tokens.
 
  First we need a feedback from him, but I guess he would like to have the
  phrase query, to not tokenise the text within the double quotes.
 
  In the case we should find a way.
 
 
  Cheers
 
  2015-07-21 13:12 GMT+01:00 Jack Krupansky jack.krupan...@gmail.com:
 
   If you don't explicitly enable automatic phrase queries, the Lucene
 query
   parser will assume an OR operator on the sub-terms when a white
   space-delimited term analyzes into a sequence of terms.
  
   See:
  
  
 
 https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean)
  
  
   -- Jack Krupansky
  
   On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti socac...@gmail.com
   wrote:
  
Hi all,
   
i'm new to lucene and tried to write my own analyzer to support
hyphenated words like wi-fi, jean-pierre, etc.
For our customer it is important to find the word
- wi-fi by wi, fi, wifi, wi-fi
- jean-pierre by jean, pierre, jean-pierre, jean-*
   
   
   
   
The analyzer:
public class SupportHyphenatedWordsAnalyzer extends Analyzer {
   
  protected NormalizeCharMap charConvertMap;
   
  public MinLuceneAnalyzer() {
initCharConvertMap();
  }
   
  protected void initCharConvertMap() {
NormalizeCharMap.Builder builder = new
 NormalizeCharMap.Builder();
builder.add(\, );
charConvertMap = builder.build();
  }
   
  @Override
  protected TokenStreamComponents createComponents(final String
   fieldName)
{
   
final Tokenizer src = new WhitespaceTokenizer();
   
TokenStream tok = new WordDelimiterFilter(src,
WordDelimiterFilter.PRESERVE_ORIGINAL
| WordDelimiterFilter.GENERATE_WORD_PARTS
| WordDelimiterFilter.GENERATE_NUMBER_PARTS
| WordDelimiterFilter.CATENATE_WORDS,
null);
tok = new LowerCaseFilter(tok);
tok = new LengthFilter(tok, 1, 255);
tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
   
return new TokenStreamComponents(src, tok);
  }
   
  @Override
  protected Reader initReader(String fieldName, Reader reader) {
return new MappingCharFilter(charConvertMap, reader);
  }
}
   
   
   
   
   
The analyzer seems to work except for exact phrase match queries.
   
e.g. the following words are indexed
   
FD-A320-REC-SIM-1
FD-A320-REC-SIM-10
FD-A320-REC-SIM-11
MIA-FD-A320-REC-SIM-1
SIN-FD-A320-REC-SIM-1
   
   
The (exact) query FD-A320-REC-SIM-1 returns
FD-A320-REC-SIM-1
MIA-FD-A320-REC-SIM-1
SIN-FD-A320-REC-SIM-1
   
for our customer this is wrong because this exact phrase match
query should only return the single entry FD-A320-REC-SIM-1
   
Do you have any ideas or tips, how we have to change our current
analyzer to support this requirement???
   
   
Thanks and Kind regards
Diego
   
  
 
 
 
  --
  --
 
  Benedetti Alessandro
  Visiting card - http://about.me/alessandro_benedetti
  Blog - http://alexbenedetti.blogspot.co.uk
 
  Tyger, tyger burning bright
  In the forests of the night,
  What immortal hand or eye
  Could frame thy fearful symmetry?
 
  William Blake - Songs of Experience -1794 England
 




-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

Re: Analyzer for supporting hyphenated words

2015-07-21 Thread Jack Krupansky

If you don't explicitly enable automatic phrase queries, the Lucene query
parser will assume an OR operator on the sub-terms when a white
space-delimited term analyzes into a sequence of terms.

See:
https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean)


-- Jack Krupansky

On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti socac...@gmail.com wrote:

 Hi all,

 i'm new to lucene and tried to write my own analyzer to support
 hyphenated words like wi-fi, jean-pierre, etc.
 For our customer it is important to find the word
 - wi-fi by wi, fi, wifi, wi-fi
 - jean-pierre by jean, pierre, jean-pierre, jean-*




 The analyzer:
 public class SupportHyphenatedWordsAnalyzer extends Analyzer {

   protected NormalizeCharMap charConvertMap;

   public MinLuceneAnalyzer() {
 initCharConvertMap();
   }

   protected void initCharConvertMap() {
 NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
 builder.add(\, );
 charConvertMap = builder.build();
   }

   @Override
   protected TokenStreamComponents createComponents(final String fieldName)
 {

 final Tokenizer src = new WhitespaceTokenizer();

 TokenStream tok = new WordDelimiterFilter(src,
 WordDelimiterFilter.PRESERVE_ORIGINAL
 | WordDelimiterFilter.GENERATE_WORD_PARTS
 | WordDelimiterFilter.GENERATE_NUMBER_PARTS
 | WordDelimiterFilter.CATENATE_WORDS,
 null);
 tok = new LowerCaseFilter(tok);
 tok = new LengthFilter(tok, 1, 255);
 tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET);

 return new TokenStreamComponents(src, tok);
   }

   @Override
   protected Reader initReader(String fieldName, Reader reader) {
 return new MappingCharFilter(charConvertMap, reader);
   }
 }





 The analyzer seems to work except for exact phrase match queries.

 e.g. the following words are indexed

 FD-A320-REC-SIM-1
 FD-A320-REC-SIM-10
 FD-A320-REC-SIM-11
 MIA-FD-A320-REC-SIM-1
 SIN-FD-A320-REC-SIM-1


 The (exact) query FD-A320-REC-SIM-1 returns
 FD-A320-REC-SIM-1
 MIA-FD-A320-REC-SIM-1
 SIN-FD-A320-REC-SIM-1

 for our customer this is wrong because this exact phrase match
 query should only return the single entry FD-A320-REC-SIM-1

 Do you have any ideas or tips, how we have to change our current
 analyzer to support this requirement???


 Thanks and Kind regards
 Diego

Re: Analyzer for supporting hyphenated words

2015-07-21 Thread Alessandro Benedetti

Hi Diego,
let me try to help :

I find this a little bit confused :

For our customer it is important to find the word
- *wi-fi* by wi, *fi*, wifi, wi-fi
- jean-pierre by jean, pierre, jean-pierre, jean-*

But :

The (exact) query *FD-A320-REC-SIM-1* returns
FD-A320-REC-SIM-1
MIA-*FD-A320-REC-SIM-1*
SIN-FD-A320-REC-SIM-1

for our customer this is wrong because this exact phrase match
query should only return the single entry FD-A320-REC-SIM-1


If you noticed the suffix fi in the first example can be compared to the
suffix FD-A320-REC-SIM-1 in the second.
To qualify your requirement :

Do you want the user to be able to surround the query with  to run the
phrase query with a NOT tokenized phrase ?
Because by default , a phrase query is tokenized like the others, but term
positions affect the matching !
In the case I identified your requirement, we can have a think to a
solution!


Cheers



2015-07-17 9:41 GMT+01:00 Diego Socaceti socac...@gmail.com:

 Hi all,

 i'm new to lucene and tried to write my own analyzer to support
 hyphenated words like wi-fi, jean-pierre, etc.
 For our customer it is important to find the word
 - wi-fi by wi, fi, wifi, wi-fi
 - jean-pierre by jean, pierre, jean-pierre, jean-*




 The analyzer:
 public class SupportHyphenatedWordsAnalyzer extends Analyzer {

   protected NormalizeCharMap charConvertMap;

   public MinLuceneAnalyzer() {
 initCharConvertMap();
   }

   protected void initCharConvertMap() {
 NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
 builder.add(\, );
 charConvertMap = builder.build();
   }

   @Override
   protected TokenStreamComponents createComponents(final String fieldName)
 {

 final Tokenizer src = new WhitespaceTokenizer();

 TokenStream tok = new WordDelimiterFilter(src,
 WordDelimiterFilter.PRESERVE_ORIGINAL
 | WordDelimiterFilter.GENERATE_WORD_PARTS
 | WordDelimiterFilter.GENERATE_NUMBER_PARTS
 | WordDelimiterFilter.CATENATE_WORDS,
 null);
 tok = new LowerCaseFilter(tok);
 tok = new LengthFilter(tok, 1, 255);
 tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET);

 return new TokenStreamComponents(src, tok);
   }

   @Override
   protected Reader initReader(String fieldName, Reader reader) {
 return new MappingCharFilter(charConvertMap, reader);
   }
 }





 The analyzer seems to work except for exact phrase match queries.

 e.g. the following words are indexed

 FD-A320-REC-SIM-1
 FD-A320-REC-SIM-10
 FD-A320-REC-SIM-11
 MIA-FD-A320-REC-SIM-1
 SIN-FD-A320-REC-SIM-1


 The (exact) query FD-A320-REC-SIM-1 returns
 FD-A320-REC-SIM-1
 MIA-FD-A320-REC-SIM-1
 SIN-FD-A320-REC-SIM-1

 for our customer this is wrong because this exact phrase match
 query should only return the single entry FD-A320-REC-SIM-1

 Do you have any ideas or tips, how we have to change our current
 analyzer to support this requirement???


 Thanks and Kind regards
 Diego




-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

Re: Analyzer for supporting hyphenated words

2015-07-21 Thread Alessandro Benedetti

Hey Jack, reading the doc :

 Set to true if phrase queries will be automatically generated when the
analyzer returns more than one term from whitespace delimited text. NOTE:
this behavior may not be suitable for all languages.

Set to false if phrase queries should only be generated when surrounded by
double quotes.


In the user case , i guess he's likely to use double quotes.

The only problem he sees so far is that the phrase query uses the query
time analyser to actually split the tokens.

First we need a feedback from him, but I guess he would like to have the
phrase query, to not tokenise the text within the double quotes.

In the case we should find a way.


Cheers

2015-07-21 13:12 GMT+01:00 Jack Krupansky jack.krupan...@gmail.com:

 If you don't explicitly enable automatic phrase queries, the Lucene query
 parser will assume an OR operator on the sub-terms when a white
 space-delimited term analyzes into a sequence of terms.

 See:

 https://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean)


 -- Jack Krupansky

 On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti socac...@gmail.com
 wrote:

  Hi all,
 
  i'm new to lucene and tried to write my own analyzer to support
  hyphenated words like wi-fi, jean-pierre, etc.
  For our customer it is important to find the word
  - wi-fi by wi, fi, wifi, wi-fi
  - jean-pierre by jean, pierre, jean-pierre, jean-*
 
 
 
 
  The analyzer:
  public class SupportHyphenatedWordsAnalyzer extends Analyzer {
 
protected NormalizeCharMap charConvertMap;
 
public MinLuceneAnalyzer() {
  initCharConvertMap();
}
 
protected void initCharConvertMap() {
  NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
  builder.add(\, );
  charConvertMap = builder.build();
}
 
@Override
protected TokenStreamComponents createComponents(final String
 fieldName)
  {
 
  final Tokenizer src = new WhitespaceTokenizer();
 
  TokenStream tok = new WordDelimiterFilter(src,
  WordDelimiterFilter.PRESERVE_ORIGINAL
  | WordDelimiterFilter.GENERATE_WORD_PARTS
  | WordDelimiterFilter.GENERATE_NUMBER_PARTS
  | WordDelimiterFilter.CATENATE_WORDS,
  null);
  tok = new LowerCaseFilter(tok);
  tok = new LengthFilter(tok, 1, 255);
  tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
 
  return new TokenStreamComponents(src, tok);
}
 
@Override
protected Reader initReader(String fieldName, Reader reader) {
  return new MappingCharFilter(charConvertMap, reader);
}
  }
 
 
 
 
 
  The analyzer seems to work except for exact phrase match queries.
 
  e.g. the following words are indexed
 
  FD-A320-REC-SIM-1
  FD-A320-REC-SIM-10
  FD-A320-REC-SIM-11
  MIA-FD-A320-REC-SIM-1
  SIN-FD-A320-REC-SIM-1
 
 
  The (exact) query FD-A320-REC-SIM-1 returns
  FD-A320-REC-SIM-1
  MIA-FD-A320-REC-SIM-1
  SIN-FD-A320-REC-SIM-1
 
  for our customer this is wrong because this exact phrase match
  query should only return the single entry FD-A320-REC-SIM-1
 
  Do you have any ideas or tips, how we have to change our current
  analyzer to support this requirement???
 
 
  Thanks and Kind regards
  Diego
 




-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

Re: Analyzer for supporting hyphenated words

Re: Analyzer for supporting hyphenated words

Re: Analyzer for supporting hyphenated words

Re: Analyzer for supporting hyphenated words

Re: Analyzer for supporting hyphenated words

Re: Analyzer for supporting hyphenated words

Re: Analyzer for supporting hyphenated words

Re: Analyzer for supporting hyphenated words

Re: Analyzer for supporting hyphenated words

Re: Analyzer for supporting hyphenated words

Re: Analyzer for supporting hyphenated words

11 matches

Site Navigation

Mail list logo

Footer information