[
https://issues.apache.org/jira/browse/LUCENE-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15721497#comment-15721497
]
peina commented on LUCENE-7509:
-------------------------------
BTW, is there any chance that https://issues.apache.org/jira/browse/LUCENE-7508
will be fixed?
> [smartcn] Some chinese text is not tokenized correctly with Chinese
> punctuation marks appended
> ----------------------------------------------------------------------------------------------
>
> Key: LUCENE-7509
> URL: https://issues.apache.org/jira/browse/LUCENE-7509
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Affects Versions: 6.2.1
> Environment: Mac OS X 10.10
> Reporter: peina
> Labels: chinese, tokenization
>
> Some chinese text is not tokenized correctly with Chinese punctuation marks
> appended.
> e.g.
> 碧绿的眼珠 is tokenized as 碧绿|的|眼珠. Which is correct.
> But
> 碧绿的眼珠,(with a Chinese punctuation appended )is tokenized as 碧绿|的|眼|珠,
> The similar case happens when text with numbers appended.
> e.g.
> 生活报8月4号 -->生活|报|8|月|4|号
> 生活报-->生活报
> Test Sample:
> public static void main(String[] args) throws IOException{
> Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */
> System.out.println("Sample1=======");
> String sentence = "生活报8月4号";
> printTokens(analyzer, sentence);
> sentence = "生活报";
> printTokens(analyzer, sentence);
> System.out.println("Sample2=======");
>
> sentence = "碧绿的眼珠,";
> printTokens(analyzer, sentence);
> sentence = "碧绿的眼珠";
> printTokens(analyzer, sentence);
>
> analyzer.close();
> }
> private static void printTokens(Analyzer analyzer, String sentence) throws
> IOException{
> System.out.println("sentence:" + sentence);
> TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
> tokens.reset();
> CharTermAttribute termAttr = (CharTermAttribute)
> tokens.getAttribute(CharTermAttribute.class);
> while (tokens.incrementToken()) {
> System.out.println(termAttr.toString());
> }
> tokens.close();
> }
> Output:
> Sample1=======
> sentence:生活报8月4号
> 生活
> 报
> 8
> 月
> 4
> 号
> sentence:生活报
> 生活报
> Sample2=======
> sentence:碧绿的眼珠,
> 碧绿
> 的
> 眼
> 珠
> sentence:碧绿的眼珠
> 碧绿
> 的
> 眼珠
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]