It looks like my attachment was lost. It referred to
org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer.
I'm inlining it here:
import java.io.IOException;
import java.io.StringReader;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;
public class ChineseTokenizerTest {
public static void main(String[] args) throws IOException {
tokenizeChineseWords("我是中国人"/*"我"(I) "是"(am) "中国"
"人"(Chinese = people of China)*/);
tokenizeChineseWords("?");
}
private static void tokenizeChineseWords(String chineseWords)
throws IOException {
SmartChineseAnalyzer analyzer = new
SmartChineseAnalyzer(Version.LUCENE_36);
TokenStream tokenizer = analyzer.tokenStream(null/*field
name*/, new StringReader(chineseWords));
System.out.print("Sentence: ");
print(chineseWords);
System.out.println();
System.out.print("Tokens: [");
while (tokenizer.incrementToken()) {
CharSequence charTermAttribute =
tokenizer.getAttribute(CharTermAttribute.class);
print(charTermAttribute);
System.out.print(" ");
}
System.out.println("]");
System.out.println();
}
private static void print(CharSequence charTermAttribute) {
System.out.print(charTermAttribute);
System.out.print("(");
for (int i = 0, length = charTermAttribute.length(); i <
length; i++) {
System.out.print((int)
charTermAttribute.charAt(i));
if (i < length-1)
System.out.print(" ");
}
System.out.print(")");
}
}
From: Robert Muir <[email protected]>
To: [email protected],
Date: 01/24/2013 04:31 PM
Subject: Re: Chinese analyzer
On Thu, Jan 24, 2013 at 9:25 AM, Jerome Lanneluc
<[email protected]> wrote:
> Note the 2 tokens in the second sample when I would expect to have only
one
> token with the (55401 57046) characters.
>
> I could not figure out if I'm doing something wrong, or if this is a bug
in
> the Chinese analyzer.
>
Which analyzer specifically? there is more than one...
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Sauf indication contraire ci-dessus:/ Unless stated otherwise above:
Compagnie IBM France
Siège Social : 17 avenue de l'Europe, 92275 Bois-Colombes Cedex
RCS Nanterre 552 118 465
Forme Sociale : S.A.S.
Capital Social : 653.242.306,20 �
SIREN/SIRET : 552 118 465 03644 - Code NAF 6202A