Re: getting full english word from tokenizing with SmartChineseAnalyzer

Michael Mastroianni Fri, 14 Aug 2015 08:49:16 -0700

The easiest thing to do is to create your own analyzer, cut and paste the
code from org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer into it,
and get rid of the line in createComponents(String fieldName, Reader
reader)  that says


    result = new PorterStemFilter(result);


On Fri, Aug 14, 2015 at 11:20 AM, Wayne Xin <[email protected]> wrote:

> Hi,
>
>
>
> I am new with Lucene Analyzer. I would like to get the full English tokens
> from SmartChineseAnalyzer. But I’m only getting stems. The following code
> has predefined the sentence in "testStr":
> String testStr = "女单方面，王适娴second seed和头号种子卫冕冠军西班牙选手马
> 林first seed同处1/4区，3号种子李雪芮和韩国选手Korean player成池铉处在2/4区，不
> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区，6号种子王仪涵若想
> 晋级决赛secure position. congratulations.";
>
> The printed tokenized result is:
>
> 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林
> first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池
> 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原 希望 这
> 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul
>
> As you can see some long English tokens such as Japanese, position and
> congratulations are cut short in the tokenization process. I hope I didn't
> use it wrong.
>
> Test code:
>
> private static void testChineseTokenizer() {
> String testStr = "女单方面，王适娴second seed和头号种子卫冕冠军西班牙选手马
> 林first seed同处1/4区，3号种子李雪芮和韩国选手Korean player成池铉处在2/4区，不
> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区，6号种子王仪涵若想
> 晋级决赛secure position. congratulations.";
> Analyzer analyzer = new SmartChineseAnalyzer();
> List<String> result = new ArrayList<String>();
> StringReader sr = new StringReader(testStr);
>
> try {
> TokenStream stream = analyzer.tokenStream(null,sr);
> CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
> stream.reset();
> while (stream.incrementToken())
> { String token = cattr.toString(); result.add(token); }
>
> stream.end();
> stream.close();
> sr.close();
> analyzer.close();
> stream = null;
> for (String tok: result)
> { System.out.print(" " + tok); }
>
> System.out.println();
> }
> catch(IOException e)
> { // not thrown b/c we're using a string reader... }
>
> }
>
>
>
>

Re: getting full english word from tokenizing with SmartChineseAnalyzer

Reply via email to