Re: How to use case in-sentive search

2015-08-14 Thread Jack Krupansky
I was assuming this was a Lucene question...

The StandardAnalyzer already includes the lower case filter, so the default
should be case-insensitive query.

See:
https://lucene.apache.org/core/5_2_1/analyzers-common/org/apache/lucene/analysis/standard/StandardAnalyzer.html

If the question was really how to get case-sensitive query, simply create
your own analyzer without the lower case filter.


-- Jack Krupansky

On Fri, Aug 14, 2015 at 10:07 AM, Erick Erickson erickerick...@gmail.com
wrote:

 Add LowercaseFilterFactory to your analysis chain for the fieldType
 both at query and index time. You'll need to re-index.

 The admin UI/analysis page will help you understand the effects
 of each analysis step defined in your fieldTypes.

 Best,
 Erick

 On Fri, Aug 14, 2015 at 3:44 AM, vardhaman narasagoudar
 vardhama...@gmail.com wrote:
  Dear Team,
 
  I am trying to build a search engine for fetching person info based on
 name
  or  email Id. For this I have standard Analyzer  wildcard. If I enter
 case
  senstive query I get the result. but how to go about for case in-senstive
 
  I mean if I search for rohan or Rohan should be same, Currently I  search
  as per DB that is Rohan , I get the result  not for rohan.
 
  I have posted the same query in Stack overflow
 
 http://stackoverflow.com/questions/30881355/java-lucene-4-5-how-to-search-by-case-insensitive/30926385#30926385
 
  Please help me out, is there any refernce where I can look in
 
  --
  Thanks  Regards
  Vardhaman B.N
  9945840928

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




How to use case in-sentive search

2015-08-14 Thread vardhaman narasagoudar
Dear Team,

I am trying to build a search engine for fetching person info based on name
or  email Id. For this I have standard Analyzer  wildcard. If I enter case
senstive query I get the result. but how to go about for case in-senstive

I mean if I search for rohan or Rohan should be same, Currently I  search
as per DB that is Rohan , I get the result  not for rohan.

I have posted the same query in Stack overflow
http://stackoverflow.com/questions/30881355/java-lucene-4-5-how-to-search-by-case-insensitive/30926385#30926385

Please help me out, is there any refernce where I can look in

-- 
Thanks  Regards
Vardhaman B.N
9945840928


Re: How to use case in-sentive search

2015-08-14 Thread Erick Erickson
Add LowercaseFilterFactory to your analysis chain for the fieldType
both at query and index time. You'll need to re-index.

The admin UI/analysis page will help you understand the effects
of each analysis step defined in your fieldTypes.

Best,
Erick

On Fri, Aug 14, 2015 at 3:44 AM, vardhaman narasagoudar
vardhama...@gmail.com wrote:
 Dear Team,

 I am trying to build a search engine for fetching person info based on name
 or  email Id. For this I have standard Analyzer  wildcard. If I enter case
 senstive query I get the result. but how to go about for case in-senstive

 I mean if I search for rohan or Rohan should be same, Currently I  search
 as per DB that is Rohan , I get the result  not for rohan.

 I have posted the same query in Stack overflow
 http://stackoverflow.com/questions/30881355/java-lucene-4-5-how-to-search-by-case-insensitive/30926385#30926385

 Please help me out, is there any refernce where I can look in

 --
 Thanks  Regards
 Vardhaman B.N
 9945840928

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to use case in-sentive search

2015-08-14 Thread Uwe Schindler
Hi,

Wildcard queries don't use the Analyzer, so they are case sensitive. Most of 
Lucene's query parsers allow to lowercase although there is a wildcard, but xou 
have to enable this. 

In most cases it is recommended to use a plain simple analyzer for fields using 
wildcards. If you also have stemming this will not work correctly with 
wildcards.

In general, if your queries require wildcards by default then you should review 
your analysis! A good configured analysis chain should allow the user to find 
stuff without using wildcards!!!

Uwe

Am 14. August 2015 16:12:46 MESZ, schrieb Jack Krupansky 
jack.krupan...@gmail.com:
I was assuming this was a Lucene question...

The StandardAnalyzer already includes the lower case filter, so the
default
should be case-insensitive query.

See:
https://lucene.apache.org/core/5_2_1/analyzers-common/org/apache/lucene/analysis/standard/StandardAnalyzer.html

If the question was really how to get case-sensitive query, simply
create
your own analyzer without the lower case filter.


-- Jack Krupansky

On Fri, Aug 14, 2015 at 10:07 AM, Erick Erickson
erickerick...@gmail.com
wrote:

 Add LowercaseFilterFactory to your analysis chain for the fieldType
 both at query and index time. You'll need to re-index.

 The admin UI/analysis page will help you understand the effects
 of each analysis step defined in your fieldTypes.

 Best,
 Erick

 On Fri, Aug 14, 2015 at 3:44 AM, vardhaman narasagoudar
 vardhama...@gmail.com wrote:
  Dear Team,
 
  I am trying to build a search engine for fetching person info based
on
 name
  or  email Id. For this I have standard Analyzer  wildcard. If I
enter
 case
  senstive query I get the result. but how to go about for case
in-senstive
 
  I mean if I search for rohan or Rohan should be same, Currently I 
search
  as per DB that is Rohan , I get the result  not for rohan.
 
  I have posted the same query in Stack overflow
 

http://stackoverflow.com/questions/30881355/java-lucene-4-5-how-to-search-by-case-insensitive/30926385#30926385
 
  Please help me out, is there any refernce where I can look in
 
  --
  Thanks  Regards
  Vardhaman B.N
  9945840928

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de

getting full english word from tokenizing with SmartChineseAnalyzer

2015-08-14 Thread Wayne Xin
Hi,



I am new with Lucene Analyzer. I would like to get the full English tokens
from SmartChineseAnalyzer. But I’m only getting stems. The following code
has predefined the sentence in testStr:
String testStr = 女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马
林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不
过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区,6号种子王仪涵若想
晋级决赛secure position. congratulations.;

The printed tokenized result is:

女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林
first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池
铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原 希望 这
关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul

As you can see some long English tokens such as Japanese, position and
congratulations are cut short in the tokenization process. I hope I didn't
use it wrong.

Test code:

private static void testChineseTokenizer() {
String testStr = 女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马
林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不
过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区,6号种子王仪涵若想
晋级决赛secure position. congratulations.;
Analyzer analyzer = new SmartChineseAnalyzer();
ListString result = new ArrayListString();
StringReader sr = new StringReader(testStr);

try {
TokenStream stream = analyzer.tokenStream(null,sr);
CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken())
{ String token = cattr.toString(); result.add(token); }

stream.end();
stream.close();
sr.close();
analyzer.close();
stream = null;
for (String tok: result)
{ System.out.print(  + tok); }

System.out.println();
}
catch(IOException e)
{ // not thrown b/c we're using a string reader... }

}





Re: getting full english word from tokenizing with SmartChineseAnalyzer

2015-08-14 Thread Michael Mastroianni
The easiest thing to do is to create your own analyzer, cut and paste the
code from org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer into it,
and get rid of the line in createComponents(String fieldName, Reader
reader)  that says

result = new PorterStemFilter(result);


On Fri, Aug 14, 2015 at 11:20 AM, Wayne Xin wayne_...@hotmail.com wrote:

 Hi,



 I am new with Lucene Analyzer. I would like to get the full English tokens
 from SmartChineseAnalyzer. But I’m only getting stems. The following code
 has predefined the sentence in testStr:
 String testStr = 女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马
 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不
 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区,6号种子王仪涵若想
 晋级决赛secure position. congratulations.;

 The printed tokenized result is:

 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林
 first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池
 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原 希望 这
 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul

 As you can see some long English tokens such as Japanese, position and
 congratulations are cut short in the tokenization process. I hope I didn't
 use it wrong.

 Test code:

 private static void testChineseTokenizer() {
 String testStr = 女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马
 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不
 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区,6号种子王仪涵若想
 晋级决赛secure position. congratulations.;
 Analyzer analyzer = new SmartChineseAnalyzer();
 ListString result = new ArrayListString();
 StringReader sr = new StringReader(testStr);

 try {
 TokenStream stream = analyzer.tokenStream(null,sr);
 CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
 stream.reset();
 while (stream.incrementToken())
 { String token = cattr.toString(); result.add(token); }

 stream.end();
 stream.close();
 sr.close();
 analyzer.close();
 stream = null;
 for (String tok: result)
 { System.out.print(  + tok); }

 System.out.println();
 }
 catch(IOException e)
 { // not thrown b/c we're using a string reader... }

 }






Re: getting full english word from tokenizing with SmartChineseAnalyzer

2015-08-14 Thread Wayne Xin
Thanks Michael. That works well. Not sure why SmartChineseAnalyzer is
final, otherwise we could overwrite createComponents().

New output:

女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林
first seed 同 处 1 4 区 3 号
种子 李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池 铉
先 要 过 日本 小将 
japanese player 奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛
secure position 
congratulations

-Wayne



On 8/14/15, 8:48 AM, Michael Mastroianni mmastroia...@placester.com
wrote:

The easiest thing to do is to create your own analyzer, cut and paste the
code from org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer into
it,
and get rid of the line in createComponents(String fieldName, Reader
reader)  that says

result = new PorterStemFilter(result);


On Fri, Aug 14, 2015 at 11:20 AM, Wayne Xin wayne_...@hotmail.com wrote:

 Hi,



 I am new with Lucene Analyzer. I would like to get the full English
tokens
 from SmartChineseAnalyzer. But I’m only getting stems. The following
code
 has predefined the sentence in testStr:
 String testStr = 女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马
 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不
 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区,6号种子王仪涵若想
 晋级决赛secure position. congratulations.;

 The printed tokenized result is:

 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林
 first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池
 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原 希望 这
 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul

 As you can see some long English tokens such as Japanese, position and
 congratulations are cut short in the tokenization process. I hope I
didn't
 use it wrong.

 Test code:

 private static void testChineseTokenizer() {
 String testStr = 女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马
 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不
 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区,6号种子王仪涵若想
 晋级决赛secure position. congratulations.;
 Analyzer analyzer = new SmartChineseAnalyzer();
 ListString result = new ArrayListString();
 StringReader sr = new StringReader(testStr);

 try {
 TokenStream stream = analyzer.tokenStream(null,sr);
 CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
 stream.reset();
 while (stream.incrementToken())
 { String token = cattr.toString(); result.add(token); }

 stream.end();
 stream.close();
 sr.close();
 analyzer.close();
 stream = null;
 for (String tok: result)
 { System.out.print(  + tok); }

 System.out.println();
 }
 catch(IOException e)
 { // not thrown b/c we're using a string reader... }

 }







-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: getting full english word from tokenizing with SmartChineseAnalyzer

2015-08-14 Thread Wayne Xin
Thanks Uwe. This seems to be a handy tool. My problem is I need a better
example (tutorial maybe) to show me what are necessary/default filters a
SmartChineseAnalyzer or JapaneseAnalyzer needs. In this case, I guess I
need a HMMChineseTokenzier and a stop filter but not a porter stem filter.
I could give a try later but a tutorial would be nice. Thanks for the
suggestion though.

-Wayne

On 8/14/15, 4:40 PM, Uwe Schindler u...@thetaphi.de wrote:

Hi,

it's much easier to create own analyzers since Lucene 5.0 (without
defining your own classes):
https://lucene.apache.org/core/5_2_1/analyzers-common/org/apache/lucene/an
alysis/custom/CustomAnalyzer.html
Using the builder you can create your own analyzer just with a few lines
of code. The names and params used are the factories known from Apache
Solr.

Analyzers are final by design.

Uwe
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Wayne Xin [mailto:wayne_...@hotmail.com]
 Sent: Friday, August 14, 2015 8:44 PM
 To: java-user@lucene.apache.org
 Subject: Re: getting full english word from tokenizing with
 SmartChineseAnalyzer
 
 Thanks Michael. That works well. Not sure why SmartChineseAnalyzer is
 final, otherwise we could overwrite createComponents().
 
 New output:
 
 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手
 马 林
 first seed 同 处 1 4 区 3 号
 种子 李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池
 铉
 先 要 过 日本 小将
 japanese player 奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级
 决赛
 secure position
 congratulations
 
 -Wayne
 
 
 
 On 8/14/15, 8:48 AM, Michael Mastroianni mmastroia...@placester.com
 wrote:
 
 The easiest thing to do is to create your own analyzer, cut and paste
 the code from org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer
 into it, and get rid of the line in createComponents(String fieldName,
 Reader
 reader)  that says
 
 result = new PorterStemFilter(result);
 
 
 On Fri, Aug 14, 2015 at 11:20 AM, Wayne Xin wayne_...@hotmail.com
 wrote:
 
  Hi,
 
 
 
  I am new with Lucene Analyzer. I would like to get the full English
 tokens  from SmartChineseAnalyzer. But I’m only getting stems. The
 following code  has predefined the sentence in testStr:
  String testStr = 女单方面,王适娴second seed和头号种子卫冕冠军
 西班牙选手马
  林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成
 池铉处在2/4区,不
  过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区
 ,6号种子王仪涵若想
  晋级决赛secure position. congratulations.;
 
  The printed tokenized result is:
 
  女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选
 手 马 林
  first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成
 池
  铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原
 希望 这
  关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul
 
  As you can see some long English tokens such as Japanese, position
 and  congratulations are cut short in the tokenization process. I hope
 I didn't  use it wrong.
 
  Test code:
 
  private static void testChineseTokenizer() { String testStr =
  女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马
  林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成
 池铉处在2/4区,不
  过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区
 ,6号种子王仪涵若想
  晋级决赛secure position. congratulations.; Analyzer analyzer = new
  SmartChineseAnalyzer(); ListString result = new
  ArrayListString(); StringReader sr = new StringReader(testStr);
 
  try {
  TokenStream stream = analyzer.tokenStream(null,sr); CharTermAttribute
  cattr = stream.addAttribute(CharTermAttribute.class);
  stream.reset();
  while (stream.incrementToken())
  { String token = cattr.toString(); result.add(token); }
 
  stream.end();
  stream.close();
  sr.close();
  analyzer.close();
  stream = null;
  for (String tok: result)
  { System.out.print(  + tok); }
 
  System.out.println();
  }
  catch(IOException e)
  { // not thrown b/c we're using a string reader... }
 
  }
 
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: getting full english word from tokenizing with SmartChineseAnalyzer

2015-08-14 Thread Uwe Schindler
Hi,

it's much easier to create own analyzers since Lucene 5.0 (without defining 
your own classes):
https://lucene.apache.org/core/5_2_1/analyzers-common/org/apache/lucene/analysis/custom/CustomAnalyzer.html
Using the builder you can create your own analyzer just with a few lines of 
code. The names and params used are the factories known from Apache Solr.

Analyzers are final by design.

Uwe
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Wayne Xin [mailto:wayne_...@hotmail.com]
 Sent: Friday, August 14, 2015 8:44 PM
 To: java-user@lucene.apache.org
 Subject: Re: getting full english word from tokenizing with
 SmartChineseAnalyzer
 
 Thanks Michael. That works well. Not sure why SmartChineseAnalyzer is
 final, otherwise we could overwrite createComponents().
 
 New output:
 
 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手
 马 林
 first seed 同 处 1 4 区 3 号
 种子 李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池
 铉
 先 要 过 日本 小将
 japanese player 奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级
 决赛
 secure position
 congratulations
 
 -Wayne
 
 
 
 On 8/14/15, 8:48 AM, Michael Mastroianni mmastroia...@placester.com
 wrote:
 
 The easiest thing to do is to create your own analyzer, cut and paste
 the code from org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer
 into it, and get rid of the line in createComponents(String fieldName,
 Reader
 reader)  that says
 
 result = new PorterStemFilter(result);
 
 
 On Fri, Aug 14, 2015 at 11:20 AM, Wayne Xin wayne_...@hotmail.com
 wrote:
 
  Hi,
 
 
 
  I am new with Lucene Analyzer. I would like to get the full English
 tokens  from SmartChineseAnalyzer. But I’m only getting stems. The
 following code  has predefined the sentence in testStr:
  String testStr = 女单方面,王适娴second seed和头号种子卫冕冠军
 西班牙选手马
  林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成
 池铉处在2/4区,不
  过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区
 ,6号种子王仪涵若想
  晋级决赛secure position. congratulations.;
 
  The printed tokenized result is:
 
  女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选
 手 马 林
  first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成
 池
  铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原
 希望 这
  关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul
 
  As you can see some long English tokens such as Japanese, position
 and  congratulations are cut short in the tokenization process. I hope
 I didn't  use it wrong.
 
  Test code:
 
  private static void testChineseTokenizer() { String testStr =
  女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马
  林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成
 池铉处在2/4区,不
  过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区
 ,6号种子王仪涵若想
  晋级决赛secure position. congratulations.; Analyzer analyzer = new
  SmartChineseAnalyzer(); ListString result = new
  ArrayListString(); StringReader sr = new StringReader(testStr);
 
  try {
  TokenStream stream = analyzer.tokenStream(null,sr); CharTermAttribute
  cattr = stream.addAttribute(CharTermAttribute.class);
  stream.reset();
  while (stream.incrementToken())
  { String token = cattr.toString(); result.add(token); }
 
  stream.end();
  stream.close();
  sr.close();
  analyzer.close();
  stream = null;
  for (String tok: result)
  { System.out.print(  + tok); }
 
  System.out.println();
  }
  catch(IOException e)
  { // not thrown b/c we're using a string reader... }
 
  }
 
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org