Re: How to use case in-sentive search
I was assuming this was a Lucene question... The StandardAnalyzer already includes the lower case filter, so the default should be case-insensitive query. See: https://lucene.apache.org/core/5_2_1/analyzers-common/org/apache/lucene/analysis/standard/StandardAnalyzer.html If the question was really how to get case-sensitive query, simply create your own analyzer without the lower case filter. -- Jack Krupansky On Fri, Aug 14, 2015 at 10:07 AM, Erick Erickson erickerick...@gmail.com wrote: Add LowercaseFilterFactory to your analysis chain for the fieldType both at query and index time. You'll need to re-index. The admin UI/analysis page will help you understand the effects of each analysis step defined in your fieldTypes. Best, Erick On Fri, Aug 14, 2015 at 3:44 AM, vardhaman narasagoudar vardhama...@gmail.com wrote: Dear Team, I am trying to build a search engine for fetching person info based on name or email Id. For this I have standard Analyzer wildcard. If I enter case senstive query I get the result. but how to go about for case in-senstive I mean if I search for rohan or Rohan should be same, Currently I search as per DB that is Rohan , I get the result not for rohan. I have posted the same query in Stack overflow http://stackoverflow.com/questions/30881355/java-lucene-4-5-how-to-search-by-case-insensitive/30926385#30926385 Please help me out, is there any refernce where I can look in -- Thanks Regards Vardhaman B.N 9945840928 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
How to use case in-sentive search
Dear Team, I am trying to build a search engine for fetching person info based on name or email Id. For this I have standard Analyzer wildcard. If I enter case senstive query I get the result. but how to go about for case in-senstive I mean if I search for rohan or Rohan should be same, Currently I search as per DB that is Rohan , I get the result not for rohan. I have posted the same query in Stack overflow http://stackoverflow.com/questions/30881355/java-lucene-4-5-how-to-search-by-case-insensitive/30926385#30926385 Please help me out, is there any refernce where I can look in -- Thanks Regards Vardhaman B.N 9945840928
Re: How to use case in-sentive search
Add LowercaseFilterFactory to your analysis chain for the fieldType both at query and index time. You'll need to re-index. The admin UI/analysis page will help you understand the effects of each analysis step defined in your fieldTypes. Best, Erick On Fri, Aug 14, 2015 at 3:44 AM, vardhaman narasagoudar vardhama...@gmail.com wrote: Dear Team, I am trying to build a search engine for fetching person info based on name or email Id. For this I have standard Analyzer wildcard. If I enter case senstive query I get the result. but how to go about for case in-senstive I mean if I search for rohan or Rohan should be same, Currently I search as per DB that is Rohan , I get the result not for rohan. I have posted the same query in Stack overflow http://stackoverflow.com/questions/30881355/java-lucene-4-5-how-to-search-by-case-insensitive/30926385#30926385 Please help me out, is there any refernce where I can look in -- Thanks Regards Vardhaman B.N 9945840928 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to use case in-sentive search
Hi, Wildcard queries don't use the Analyzer, so they are case sensitive. Most of Lucene's query parsers allow to lowercase although there is a wildcard, but xou have to enable this. In most cases it is recommended to use a plain simple analyzer for fields using wildcards. If you also have stemming this will not work correctly with wildcards. In general, if your queries require wildcards by default then you should review your analysis! A good configured analysis chain should allow the user to find stuff without using wildcards!!! Uwe Am 14. August 2015 16:12:46 MESZ, schrieb Jack Krupansky jack.krupan...@gmail.com: I was assuming this was a Lucene question... The StandardAnalyzer already includes the lower case filter, so the default should be case-insensitive query. See: https://lucene.apache.org/core/5_2_1/analyzers-common/org/apache/lucene/analysis/standard/StandardAnalyzer.html If the question was really how to get case-sensitive query, simply create your own analyzer without the lower case filter. -- Jack Krupansky On Fri, Aug 14, 2015 at 10:07 AM, Erick Erickson erickerick...@gmail.com wrote: Add LowercaseFilterFactory to your analysis chain for the fieldType both at query and index time. You'll need to re-index. The admin UI/analysis page will help you understand the effects of each analysis step defined in your fieldTypes. Best, Erick On Fri, Aug 14, 2015 at 3:44 AM, vardhaman narasagoudar vardhama...@gmail.com wrote: Dear Team, I am trying to build a search engine for fetching person info based on name or email Id. For this I have standard Analyzer wildcard. If I enter case senstive query I get the result. but how to go about for case in-senstive I mean if I search for rohan or Rohan should be same, Currently I search as per DB that is Rohan , I get the result not for rohan. I have posted the same query in Stack overflow http://stackoverflow.com/questions/30881355/java-lucene-4-5-how-to-search-by-case-insensitive/30926385#30926385 Please help me out, is there any refernce where I can look in -- Thanks Regards Vardhaman B.N 9945840928 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Uwe Schindler H.-H.-Meier-Allee 63, 28213 Bremen http://www.thetaphi.de
getting full english word from tokenizing with SmartChineseAnalyzer
Hi, I am new with Lucene Analyzer. I would like to get the full English tokens from SmartChineseAnalyzer. But I’m only getting stems. The following code has predefined the sentence in testStr: String testStr = 女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区,6号种子王仪涵若想 晋级决赛secure position. congratulations.; The printed tokenized result is: 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林 first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul As you can see some long English tokens such as Japanese, position and congratulations are cut short in the tokenization process. I hope I didn't use it wrong. Test code: private static void testChineseTokenizer() { String testStr = 女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区,6号种子王仪涵若想 晋级决赛secure position. congratulations.; Analyzer analyzer = new SmartChineseAnalyzer(); ListString result = new ArrayListString(); StringReader sr = new StringReader(testStr); try { TokenStream stream = analyzer.tokenStream(null,sr); CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class); stream.reset(); while (stream.incrementToken()) { String token = cattr.toString(); result.add(token); } stream.end(); stream.close(); sr.close(); analyzer.close(); stream = null; for (String tok: result) { System.out.print( + tok); } System.out.println(); } catch(IOException e) { // not thrown b/c we're using a string reader... } }
Re: getting full english word from tokenizing with SmartChineseAnalyzer
The easiest thing to do is to create your own analyzer, cut and paste the code from org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer into it, and get rid of the line in createComponents(String fieldName, Reader reader) that says result = new PorterStemFilter(result); On Fri, Aug 14, 2015 at 11:20 AM, Wayne Xin wayne_...@hotmail.com wrote: Hi, I am new with Lucene Analyzer. I would like to get the full English tokens from SmartChineseAnalyzer. But I’m only getting stems. The following code has predefined the sentence in testStr: String testStr = 女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区,6号种子王仪涵若想 晋级决赛secure position. congratulations.; The printed tokenized result is: 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林 first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul As you can see some long English tokens such as Japanese, position and congratulations are cut short in the tokenization process. I hope I didn't use it wrong. Test code: private static void testChineseTokenizer() { String testStr = 女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区,6号种子王仪涵若想 晋级决赛secure position. congratulations.; Analyzer analyzer = new SmartChineseAnalyzer(); ListString result = new ArrayListString(); StringReader sr = new StringReader(testStr); try { TokenStream stream = analyzer.tokenStream(null,sr); CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class); stream.reset(); while (stream.incrementToken()) { String token = cattr.toString(); result.add(token); } stream.end(); stream.close(); sr.close(); analyzer.close(); stream = null; for (String tok: result) { System.out.print( + tok); } System.out.println(); } catch(IOException e) { // not thrown b/c we're using a string reader... } }
Re: getting full english word from tokenizing with SmartChineseAnalyzer
Thanks Michael. That works well. Not sure why SmartChineseAnalyzer is final, otherwise we could overwrite createComponents(). New output: 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林 first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanese player 奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secure position congratulations -Wayne On 8/14/15, 8:48 AM, Michael Mastroianni mmastroia...@placester.com wrote: The easiest thing to do is to create your own analyzer, cut and paste the code from org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer into it, and get rid of the line in createComponents(String fieldName, Reader reader) that says result = new PorterStemFilter(result); On Fri, Aug 14, 2015 at 11:20 AM, Wayne Xin wayne_...@hotmail.com wrote: Hi, I am new with Lucene Analyzer. I would like to get the full English tokens from SmartChineseAnalyzer. But I’m only getting stems. The following code has predefined the sentence in testStr: String testStr = 女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区,6号种子王仪涵若想 晋级决赛secure position. congratulations.; The printed tokenized result is: 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林 first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul As you can see some long English tokens such as Japanese, position and congratulations are cut short in the tokenization process. I hope I didn't use it wrong. Test code: private static void testChineseTokenizer() { String testStr = 女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区,6号种子王仪涵若想 晋级决赛secure position. congratulations.; Analyzer analyzer = new SmartChineseAnalyzer(); ListString result = new ArrayListString(); StringReader sr = new StringReader(testStr); try { TokenStream stream = analyzer.tokenStream(null,sr); CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class); stream.reset(); while (stream.incrementToken()) { String token = cattr.toString(); result.add(token); } stream.end(); stream.close(); sr.close(); analyzer.close(); stream = null; for (String tok: result) { System.out.print( + tok); } System.out.println(); } catch(IOException e) { // not thrown b/c we're using a string reader... } } - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: getting full english word from tokenizing with SmartChineseAnalyzer
Thanks Uwe. This seems to be a handy tool. My problem is I need a better example (tutorial maybe) to show me what are necessary/default filters a SmartChineseAnalyzer or JapaneseAnalyzer needs. In this case, I guess I need a HMMChineseTokenzier and a stop filter but not a porter stem filter. I could give a try later but a tutorial would be nice. Thanks for the suggestion though. -Wayne On 8/14/15, 4:40 PM, Uwe Schindler u...@thetaphi.de wrote: Hi, it's much easier to create own analyzers since Lucene 5.0 (without defining your own classes): https://lucene.apache.org/core/5_2_1/analyzers-common/org/apache/lucene/an alysis/custom/CustomAnalyzer.html Using the builder you can create your own analyzer just with a few lines of code. The names and params used are the factories known from Apache Solr. Analyzers are final by design. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Wayne Xin [mailto:wayne_...@hotmail.com] Sent: Friday, August 14, 2015 8:44 PM To: java-user@lucene.apache.org Subject: Re: getting full english word from tokenizing with SmartChineseAnalyzer Thanks Michael. That works well. Not sure why SmartChineseAnalyzer is final, otherwise we could overwrite createComponents(). New output: 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林 first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanese player 奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secure position congratulations -Wayne On 8/14/15, 8:48 AM, Michael Mastroianni mmastroia...@placester.com wrote: The easiest thing to do is to create your own analyzer, cut and paste the code from org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer into it, and get rid of the line in createComponents(String fieldName, Reader reader) that says result = new PorterStemFilter(result); On Fri, Aug 14, 2015 at 11:20 AM, Wayne Xin wayne_...@hotmail.com wrote: Hi, I am new with Lucene Analyzer. I would like to get the full English tokens from SmartChineseAnalyzer. But I’m only getting stems. The following code has predefined the sentence in testStr: String testStr = 女单方面,王适娴second seed和头号种子卫冕冠军 西班牙选手马 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成 池铉处在2/4区,不 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区 ,6号种子王仪涵若想 晋级决赛secure position. congratulations.; The printed tokenized result is: 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选 手 马 林 first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul As you can see some long English tokens such as Japanese, position and congratulations are cut short in the tokenization process. I hope I didn't use it wrong. Test code: private static void testChineseTokenizer() { String testStr = 女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成 池铉处在2/4区,不 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区 ,6号种子王仪涵若想 晋级决赛secure position. congratulations.; Analyzer analyzer = new SmartChineseAnalyzer(); ListString result = new ArrayListString(); StringReader sr = new StringReader(testStr); try { TokenStream stream = analyzer.tokenStream(null,sr); CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class); stream.reset(); while (stream.incrementToken()) { String token = cattr.toString(); result.add(token); } stream.end(); stream.close(); sr.close(); analyzer.close(); stream = null; for (String tok: result) { System.out.print( + tok); } System.out.println(); } catch(IOException e) { // not thrown b/c we're using a string reader... } } - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: getting full english word from tokenizing with SmartChineseAnalyzer
Hi, it's much easier to create own analyzers since Lucene 5.0 (without defining your own classes): https://lucene.apache.org/core/5_2_1/analyzers-common/org/apache/lucene/analysis/custom/CustomAnalyzer.html Using the builder you can create your own analyzer just with a few lines of code. The names and params used are the factories known from Apache Solr. Analyzers are final by design. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Wayne Xin [mailto:wayne_...@hotmail.com] Sent: Friday, August 14, 2015 8:44 PM To: java-user@lucene.apache.org Subject: Re: getting full english word from tokenizing with SmartChineseAnalyzer Thanks Michael. That works well. Not sure why SmartChineseAnalyzer is final, otherwise we could overwrite createComponents(). New output: 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林 first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanese player 奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secure position congratulations -Wayne On 8/14/15, 8:48 AM, Michael Mastroianni mmastroia...@placester.com wrote: The easiest thing to do is to create your own analyzer, cut and paste the code from org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer into it, and get rid of the line in createComponents(String fieldName, Reader reader) that says result = new PorterStemFilter(result); On Fri, Aug 14, 2015 at 11:20 AM, Wayne Xin wayne_...@hotmail.com wrote: Hi, I am new with Lucene Analyzer. I would like to get the full English tokens from SmartChineseAnalyzer. But I’m only getting stems. The following code has predefined the sentence in testStr: String testStr = 女单方面,王适娴second seed和头号种子卫冕冠军 西班牙选手马 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成 池铉处在2/4区,不 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区 ,6号种子王仪涵若想 晋级决赛secure position. congratulations.; The printed tokenized result is: 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选 手 马 林 first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul As you can see some long English tokens such as Japanese, position and congratulations are cut short in the tokenization process. I hope I didn't use it wrong. Test code: private static void testChineseTokenizer() { String testStr = 女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成 池铉处在2/4区,不 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区 ,6号种子王仪涵若想 晋级决赛secure position. congratulations.; Analyzer analyzer = new SmartChineseAnalyzer(); ListString result = new ArrayListString(); StringReader sr = new StringReader(testStr); try { TokenStream stream = analyzer.tokenStream(null,sr); CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class); stream.reset(); while (stream.incrementToken()) { String token = cattr.toString(); result.add(token); } stream.end(); stream.close(); sr.close(); analyzer.close(); stream = null; for (String tok: result) { System.out.print( + tok); } System.out.println(); } catch(IOException e) { // not thrown b/c we're using a string reader... } } - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org