Re: getting full english word from tokenizing with SmartChineseAnalyzer
Thanks Uwe. This seems to be a handy tool. My problem is I need a better example (tutorial maybe) to show me what are necessary/default filters a SmartChineseAnalyzer or JapaneseAnalyzer needs. In this case, I guess I need a HMMChineseTokenzier and a stop filter but not a porter stem filter. I could give a try later but a tutorial would be nice. Thanks for the suggestion though. -Wayne On 8/14/15, 4:40 PM, "Uwe Schindler" wrote: >Hi, > >it's much easier to create own analyzers since Lucene 5.0 (without >defining your own classes): >https://lucene.apache.org/core/5_2_1/analyzers-common/org/apache/lucene/an >alysis/custom/CustomAnalyzer.html >Using the builder you can create your own analyzer just with a few lines >of code. The names and params used are the factories known from Apache >Solr. > >Analyzers are final by design. > >Uwe >- >Uwe Schindler >H.-H.-Meier-Allee 63, D-28213 Bremen >http://www.thetaphi.de >eMail: u...@thetaphi.de > > >> -Original Message- >> From: Wayne Xin [mailto:wayne_...@hotmail.com] >> Sent: Friday, August 14, 2015 8:44 PM >> To: java-user@lucene.apache.org >> Subject: Re: getting full english word from tokenizing with >> SmartChineseAnalyzer >> >> Thanks Michael. That works well. Not sure why SmartChineseAnalyzer is >> final, otherwise we could overwrite createComponents(). >> >> New output: >> >> 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 >> 马 林 >> first seed 同 处 1 4 区 3 号 >> 种子 李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池 >> 铉 >> 先 要 过 日本 小将 >> japanese player 奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 >> 决赛 >> secure position >> congratulations >> >> -Wayne >> >> >> >> On 8/14/15, 8:48 AM, "Michael Mastroianni" >> wrote: >> >> >The easiest thing to do is to create your own analyzer, cut and paste >> >the code from org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer >> >into it, and get rid of the line in createComponents(String fieldName, >> >Reader >> >reader) that says >> > >> >result = new PorterStemFilter(result); >> > >> > >> >On Fri, Aug 14, 2015 at 11:20 AM, Wayne Xin >> wrote: >> > >> >> Hi, >> >> >> >> >> >> >> >> I am new with Lucene Analyzer. I would like to get the full English >> >>tokens from SmartChineseAnalyzer. But I’m only getting stems. The >> >>following code has predefined the sentence in "testStr": >> >> String testStr = "女单方面,王适娴second seed和头号种子卫冕冠军 >> 西班牙选手马 >> >> 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成 >> 池铉处在2/4区,不 >> >> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区 >> ,6号种子王仪涵若想 >> >> 晋级决赛secure position. congratulations."; >> >> >> >> The printed tokenized result is: >> >> >> >> 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选 >> 手 马 林 >> >> first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 >> 池 >> >> 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原 >> 希望 这 >> >> 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul >> >> >> >> As you can see some long English tokens such as Japanese, position >> >>and congratulations are cut short in the tokenization process. I hope >> >>I didn't use it wrong. >> >> >> >> Test code: >> >> >> >> private static void testChineseTokenizer() { String testStr = >> >> "女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马 >> >> 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成 >> 池铉处在2/4区,不 >> >> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区 >> ,6号种子王仪涵若想 >> >> 晋级决赛secure position. congratulations."; Analyzer analyzer = new >> >> SmartChineseAnalyzer(); List result = new >> >> ArrayList(); StringReader sr = new StringReader(testStr); >> >> >> >> try { >> >> TokenStream stream = analyzer.tokenStream(null,sr); CharTermAttribute >> >> cattr = stream.addAttribute(CharTermAttribute.class); >> >> stream.reset(); >> >> while (stream.incrementToken()) >> >> { String token = cattr.toString(); result.add(token); } >> >> >> >> stream.end(); >> >> stream.close(); >> >> sr.close(); >> >> analyzer.close(); >> >> stream = null; >> >> for (String tok: result) >> >> { System.out.print(" " + tok); } >> >> >> >> System.out.println(); >> >> } >> >> catch(IOException e) >> >> { // not thrown b/c we're using a string reader... } >> >> >> >> } >> >> >> >> >> >> >> >> >> >> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >- >To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: getting full english word from tokenizing with SmartChineseAnalyzer
Hi, it's much easier to create own analyzers since Lucene 5.0 (without defining your own classes): https://lucene.apache.org/core/5_2_1/analyzers-common/org/apache/lucene/analysis/custom/CustomAnalyzer.html Using the builder you can create your own analyzer just with a few lines of code. The names and params used are the factories known from Apache Solr. Analyzers are final by design. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Wayne Xin [mailto:wayne_...@hotmail.com] > Sent: Friday, August 14, 2015 8:44 PM > To: java-user@lucene.apache.org > Subject: Re: getting full english word from tokenizing with > SmartChineseAnalyzer > > Thanks Michael. That works well. Not sure why SmartChineseAnalyzer is > final, otherwise we could overwrite createComponents(). > > New output: > > 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 > 马 林 > first seed 同 处 1 4 区 3 号 > 种子 李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池 > 铉 > 先 要 过 日本 小将 > japanese player 奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 > 决赛 > secure position > congratulations > > -Wayne > > > > On 8/14/15, 8:48 AM, "Michael Mastroianni" > wrote: > > >The easiest thing to do is to create your own analyzer, cut and paste > >the code from org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer > >into it, and get rid of the line in createComponents(String fieldName, > >Reader > >reader) that says > > > >result = new PorterStemFilter(result); > > > > > >On Fri, Aug 14, 2015 at 11:20 AM, Wayne Xin > wrote: > > > >> Hi, > >> > >> > >> > >> I am new with Lucene Analyzer. I would like to get the full English > >>tokens from SmartChineseAnalyzer. But I’m only getting stems. The > >>following code has predefined the sentence in "testStr": > >> String testStr = "女单方面,王适娴second seed和头号种子卫冕冠军 > 西班牙选手马 > >> 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成 > 池铉处在2/4区,不 > >> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区 > ,6号种子王仪涵若想 > >> 晋级决赛secure position. congratulations."; > >> > >> The printed tokenized result is: > >> > >> 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选 > 手 马 林 > >> first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 > 池 > >> 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原 > 希望 这 > >> 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul > >> > >> As you can see some long English tokens such as Japanese, position > >>and congratulations are cut short in the tokenization process. I hope > >>I didn't use it wrong. > >> > >> Test code: > >> > >> private static void testChineseTokenizer() { String testStr = > >> "女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马 > >> 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成 > 池铉处在2/4区,不 > >> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区 > ,6号种子王仪涵若想 > >> 晋级决赛secure position. congratulations."; Analyzer analyzer = new > >> SmartChineseAnalyzer(); List result = new > >> ArrayList(); StringReader sr = new StringReader(testStr); > >> > >> try { > >> TokenStream stream = analyzer.tokenStream(null,sr); CharTermAttribute > >> cattr = stream.addAttribute(CharTermAttribute.class); > >> stream.reset(); > >> while (stream.incrementToken()) > >> { String token = cattr.toString(); result.add(token); } > >> > >> stream.end(); > >> stream.close(); > >> sr.close(); > >> analyzer.close(); > >> stream = null; > >> for (String tok: result) > >> { System.out.print(" " + tok); } > >> > >> System.out.println(); > >> } > >> catch(IOException e) > >> { // not thrown b/c we're using a string reader... } > >> > >> } > >> > >> > >> > >> > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: getting full english word from tokenizing with SmartChineseAnalyzer
Thanks Michael. That works well. Not sure why SmartChineseAnalyzer is final, otherwise we could overwrite createComponents(). New output: 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林 first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanese player 奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secure position congratulations -Wayne On 8/14/15, 8:48 AM, "Michael Mastroianni" wrote: >The easiest thing to do is to create your own analyzer, cut and paste the >code from org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer into >it, >and get rid of the line in createComponents(String fieldName, Reader >reader) that says > >result = new PorterStemFilter(result); > > >On Fri, Aug 14, 2015 at 11:20 AM, Wayne Xin wrote: > >> Hi, >> >> >> >> I am new with Lucene Analyzer. I would like to get the full English >>tokens >> from SmartChineseAnalyzer. But I’m only getting stems. The following >>code >> has predefined the sentence in "testStr": >> String testStr = "女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马 >> 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不 >> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区,6号种子王仪涵若想 >> 晋级决赛secure position. congratulations."; >> >> The printed tokenized result is: >> >> 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林 >> first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池 >> 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原 希望 这 >> 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul >> >> As you can see some long English tokens such as Japanese, position and >> congratulations are cut short in the tokenization process. I hope I >>didn't >> use it wrong. >> >> Test code: >> >> private static void testChineseTokenizer() { >> String testStr = "女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马 >> 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不 >> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区,6号种子王仪涵若想 >> 晋级决赛secure position. congratulations."; >> Analyzer analyzer = new SmartChineseAnalyzer(); >> List result = new ArrayList(); >> StringReader sr = new StringReader(testStr); >> >> try { >> TokenStream stream = analyzer.tokenStream(null,sr); >> CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class); >> stream.reset(); >> while (stream.incrementToken()) >> { String token = cattr.toString(); result.add(token); } >> >> stream.end(); >> stream.close(); >> sr.close(); >> analyzer.close(); >> stream = null; >> for (String tok: result) >> { System.out.print(" " + tok); } >> >> System.out.println(); >> } >> catch(IOException e) >> { // not thrown b/c we're using a string reader... } >> >> } >> >> >> >> - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: getting full english word from tokenizing with SmartChineseAnalyzer
The easiest thing to do is to create your own analyzer, cut and paste the code from org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer into it, and get rid of the line in createComponents(String fieldName, Reader reader) that says result = new PorterStemFilter(result); On Fri, Aug 14, 2015 at 11:20 AM, Wayne Xin wrote: > Hi, > > > > I am new with Lucene Analyzer. I would like to get the full English tokens > from SmartChineseAnalyzer. But I’m only getting stems. The following code > has predefined the sentence in "testStr": > String testStr = "女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马 > 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不 > 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区,6号种子王仪涵若想 > 晋级决赛secure position. congratulations."; > > The printed tokenized result is: > > 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林 > first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池 > 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原 希望 这 > 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul > > As you can see some long English tokens such as Japanese, position and > congratulations are cut short in the tokenization process. I hope I didn't > use it wrong. > > Test code: > > private static void testChineseTokenizer() { > String testStr = "女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马 > 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不 > 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区,6号种子王仪涵若想 > 晋级决赛secure position. congratulations."; > Analyzer analyzer = new SmartChineseAnalyzer(); > List result = new ArrayList(); > StringReader sr = new StringReader(testStr); > > try { > TokenStream stream = analyzer.tokenStream(null,sr); > CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class); > stream.reset(); > while (stream.incrementToken()) > { String token = cattr.toString(); result.add(token); } > > stream.end(); > stream.close(); > sr.close(); > analyzer.close(); > stream = null; > for (String tok: result) > { System.out.print(" " + tok); } > > System.out.println(); > } > catch(IOException e) > { // not thrown b/c we're using a string reader... } > > } > > > >