[jira] [Commented] (LUCENE-7509) [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended

Chang KaiShin (JIRA) Thu, 01 Dec 2016 21:11:35 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15714052#comment-15714052
 ]


Chang KaiShin commented on LUCENE-7509:
---------------------------------------

This is not a bug. The underlying Viterbi algorithm segmenting Chinese 
sentences is based on the probability of the occurrences of the Chinese 
Characters. Take sentence "生活报8月4号" as an example. The "报" here is meant 2 
meanings. If it is placed in the end of the sentence. It means daily newspaper. 
However, if placed with conjunctions with other Chinese Characters. It is meant 
to report something. So the algorithm segments "报" as independent word to mean 
reporting. On the Contrary,  "生活报" is assumed to have higher chance to mean 
daily newspaper. You need to add some words to the dictionary to let the 
algorithms to learn, so that you get the correct result you wanted. 

> [smartcn] Some chinese text is not tokenized correctly with Chinese 
> punctuation marks appended
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7509
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7509
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 6.2.1
>         Environment: Mac OS X 10.10
>            Reporter: peina
>              Labels: chinese, tokenization
>
> Some chinese text is not tokenized correctly with Chinese punctuation marks 
> appended.
> e.g.
> 碧绿的眼珠 is tokenized as 碧绿|的|眼珠. Which is correct.
> But 
> 碧绿的眼珠，（with a Chinese punctuation appended )is tokenized as 碧绿|的|眼|珠，
> The similar case happens when text with numbers appended.
> e.g.
> 生活报8月4号 -->生活|报|8|月|4|号
> 生活报-->生活报
> Test Sample:
> public static void main(String[] args) throws IOException{
>     Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */
>     System.out.println("Sample1=======");
>     String sentence = "生活报8月4号";
>     printTokens(analyzer, sentence);
>     sentence = "生活报";
>     printTokens(analyzer, sentence);
>     System.out.println("Sample2=======");
>     
>     sentence = "碧绿的眼珠，";
>     printTokens(analyzer, sentence);
>     sentence = "碧绿的眼珠";
>     printTokens(analyzer, sentence);
>     
>     analyzer.close();
>   }
>   private static void printTokens(Analyzer analyzer, String sentence) throws 
> IOException{
>     System.out.println("sentence:" + sentence);
>     TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
>     tokens.reset();
>     CharTermAttribute termAttr = (CharTermAttribute) 
> tokens.getAttribute(CharTermAttribute.class);
>     while (tokens.incrementToken()) {
>       System.out.println(termAttr.toString());
>     }
>     tokens.close();
>   }
> Output:
> Sample1=======
> sentence:生活报8月4号
> 生活
> 报
> 8
> 月
> 4
> 号
> sentence:生活报
> 生活报
> Sample2=======
> sentence:碧绿的眼珠，
> 碧绿
> 的
> 眼
> 珠
> sentence:碧绿的眼珠
> 碧绿
> 的
> 眼珠



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7509) [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended

Reply via email to