The easiest thing to do is to create your own analyzer, cut and paste the code from org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer into it, and get rid of the line in createComponents(String fieldName, Reader reader) that says
result = new PorterStemFilter(result); On Fri, Aug 14, 2015 at 11:20 AM, Wayne Xin <wayne_...@hotmail.com> wrote: > Hi, > > > > I am new with Lucene Analyzer. I would like to get the full English tokens > from SmartChineseAnalyzer. But I’m only getting stems. The following code > has predefined the sentence in "testStr": > String testStr = "女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马 > 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不 > 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区,6号种子王仪涵若想 > 晋级决赛secure position. congratulations."; > > The printed tokenized result is: > > 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林 > first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池 > 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原 希望 这 > 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul > > As you can see some long English tokens such as Japanese, position and > congratulations are cut short in the tokenization process. I hope I didn't > use it wrong. > > Test code: > > private static void testChineseTokenizer() { > String testStr = "女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马 > 林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不 > 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区,6号种子王仪涵若想 > 晋级决赛secure position. congratulations."; > Analyzer analyzer = new SmartChineseAnalyzer(); > List<String> result = new ArrayList<String>(); > StringReader sr = new StringReader(testStr); > > try { > TokenStream stream = analyzer.tokenStream(null,sr); > CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class); > stream.reset(); > while (stream.incrementToken()) > { String token = cattr.toString(); result.add(token); } > > stream.end(); > stream.close(); > sr.close(); > analyzer.close(); > stream = null; > for (String tok: result) > { System.out.print(" " + tok); } > > System.out.println(); > } > catch(IOException e) > { // not thrown b/c we're using a string reader... } > > } > > > >