[jira] [Commented] (LUCENE-7509) [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended

2016-12-04 Thread peina (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15721497#comment-15721497
 ] 

peina commented on LUCENE-7509:
---

BTW, is there any chance that https://issues.apache.org/jira/browse/LUCENE-7508 
will be fixed?

> [smartcn] Some chinese text is not tokenized correctly with Chinese 
> punctuation marks appended
> --
>
> Key: LUCENE-7509
> URL: https://issues.apache.org/jira/browse/LUCENE-7509
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.2.1
> Environment: Mac OS X 10.10
>Reporter: peina
>  Labels: chinese, tokenization
>
> Some chinese text is not tokenized correctly with Chinese punctuation marks 
> appended.
> e.g.
> 碧绿的眼珠 is tokenized as 碧绿|的|眼珠. Which is correct.
> But 
> 碧绿的眼珠,(with a Chinese punctuation appended )is tokenized as 碧绿|的|眼|珠,
> The similar case happens when text with numbers appended.
> e.g.
> 生活报8月4号 -->生活|报|8|月|4|号
> 生活报-->生活报
> Test Sample:
> public static void main(String[] args) throws IOException{
> Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */
> System.out.println("Sample1===");
> String sentence = "生活报8月4号";
> printTokens(analyzer, sentence);
> sentence = "生活报";
> printTokens(analyzer, sentence);
> System.out.println("Sample2===");
> 
> sentence = "碧绿的眼珠,";
> printTokens(analyzer, sentence);
> sentence = "碧绿的眼珠";
> printTokens(analyzer, sentence);
> 
> analyzer.close();
>   }
>   private static void printTokens(Analyzer analyzer, String sentence) throws 
> IOException{
> System.out.println("sentence:" + sentence);
> TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
> tokens.reset();
> CharTermAttribute termAttr = (CharTermAttribute) 
> tokens.getAttribute(CharTermAttribute.class);
> while (tokens.incrementToken()) {
>   System.out.println(termAttr.toString());
> }
> tokens.close();
>   }
> Output:
> Sample1===
> sentence:生活报8月4号
> 生活
> 报
> 8
> 月
> 4
> 号
> sentence:生活报
> 生活报
> Sample2===
> sentence:碧绿的眼珠,
> 碧绿
> 的
> 眼
> 珠
> sentence:碧绿的眼珠
> 碧绿
> 的
> 眼珠



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7509) [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended

2016-12-04 Thread peina (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15721489#comment-15721489
 ] 

peina commented on LUCENE-7509:
---

Thanks. Make sense to me.

> [smartcn] Some chinese text is not tokenized correctly with Chinese 
> punctuation marks appended
> --
>
> Key: LUCENE-7509
> URL: https://issues.apache.org/jira/browse/LUCENE-7509
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.2.1
> Environment: Mac OS X 10.10
>Reporter: peina
>  Labels: chinese, tokenization
>
> Some chinese text is not tokenized correctly with Chinese punctuation marks 
> appended.
> e.g.
> 碧绿的眼珠 is tokenized as 碧绿|的|眼珠. Which is correct.
> But 
> 碧绿的眼珠,(with a Chinese punctuation appended )is tokenized as 碧绿|的|眼|珠,
> The similar case happens when text with numbers appended.
> e.g.
> 生活报8月4号 -->生活|报|8|月|4|号
> 生活报-->生活报
> Test Sample:
> public static void main(String[] args) throws IOException{
> Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */
> System.out.println("Sample1===");
> String sentence = "生活报8月4号";
> printTokens(analyzer, sentence);
> sentence = "生活报";
> printTokens(analyzer, sentence);
> System.out.println("Sample2===");
> 
> sentence = "碧绿的眼珠,";
> printTokens(analyzer, sentence);
> sentence = "碧绿的眼珠";
> printTokens(analyzer, sentence);
> 
> analyzer.close();
>   }
>   private static void printTokens(Analyzer analyzer, String sentence) throws 
> IOException{
> System.out.println("sentence:" + sentence);
> TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
> tokens.reset();
> CharTermAttribute termAttr = (CharTermAttribute) 
> tokens.getAttribute(CharTermAttribute.class);
> while (tokens.incrementToken()) {
>   System.out.println(termAttr.toString());
> }
> tokens.close();
>   }
> Output:
> Sample1===
> sentence:生活报8月4号
> 生活
> 报
> 8
> 月
> 4
> 号
> sentence:生活报
> 生活报
> Sample2===
> sentence:碧绿的眼珠,
> 碧绿
> 的
> 眼
> 珠
> sentence:碧绿的眼珠
> 碧绿
> 的
> 眼珠



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7509) [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended

2016-12-01 Thread Chang KaiShin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15714052#comment-15714052
 ] 

Chang KaiShin commented on LUCENE-7509:
---

This is not a bug. The underlying Viterbi algorithm segmenting Chinese 
sentences is based on the probability of the occurrences of the Chinese 
Characters. Take sentence "生活报8月4号" as an example. The "报" here is meant 2 
meanings. If it is placed in the end of the sentence. It means daily newspaper. 
However, if placed with conjunctions with other Chinese Characters. It is meant 
to report something. So the algorithm segments "报" as independent word to mean 
reporting. On the Contrary,  "生活报" is assumed to have higher chance to mean 
daily newspaper. You need to add some words to the dictionary to let the 
algorithms to learn, so that you get the correct result you wanted. 

> [smartcn] Some chinese text is not tokenized correctly with Chinese 
> punctuation marks appended
> --
>
> Key: LUCENE-7509
> URL: https://issues.apache.org/jira/browse/LUCENE-7509
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.2.1
> Environment: Mac OS X 10.10
>Reporter: peina
>  Labels: chinese, tokenization
>
> Some chinese text is not tokenized correctly with Chinese punctuation marks 
> appended.
> e.g.
> 碧绿的眼珠 is tokenized as 碧绿|的|眼珠. Which is correct.
> But 
> 碧绿的眼珠,(with a Chinese punctuation appended )is tokenized as 碧绿|的|眼|珠,
> The similar case happens when text with numbers appended.
> e.g.
> 生活报8月4号 -->生活|报|8|月|4|号
> 生活报-->生活报
> Test Sample:
> public static void main(String[] args) throws IOException{
> Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */
> System.out.println("Sample1===");
> String sentence = "生活报8月4号";
> printTokens(analyzer, sentence);
> sentence = "生活报";
> printTokens(analyzer, sentence);
> System.out.println("Sample2===");
> 
> sentence = "碧绿的眼珠,";
> printTokens(analyzer, sentence);
> sentence = "碧绿的眼珠";
> printTokens(analyzer, sentence);
> 
> analyzer.close();
>   }
>   private static void printTokens(Analyzer analyzer, String sentence) throws 
> IOException{
> System.out.println("sentence:" + sentence);
> TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
> tokens.reset();
> CharTermAttribute termAttr = (CharTermAttribute) 
> tokens.getAttribute(CharTermAttribute.class);
> while (tokens.incrementToken()) {
>   System.out.println(termAttr.toString());
> }
> tokens.close();
>   }
> Output:
> Sample1===
> sentence:生活报8月4号
> 生活
> 报
> 8
> 月
> 4
> 号
> sentence:生活报
> 生活报
> Sample2===
> sentence:碧绿的眼珠,
> 碧绿
> 的
> 眼
> 珠
> sentence:碧绿的眼珠
> 碧绿
> 的
> 眼珠



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7509) [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended

2016-10-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595364#comment-15595364
 ] 

Michael McCandless commented on LUCENE-7509:


Hi [~peina], could you please turn your test fragments into a test that fails?  
See e.g. https://wiki.apache.org/lucene-java/HowToContribute

Do you know how to fix this?  Is there a Unicode API we should be using to more 
generally check for punctuation, so that Chinese punctuation is included?

> [smartcn] Some chinese text is not tokenized correctly with Chinese 
> punctuation marks appended
> --
>
> Key: LUCENE-7509
> URL: https://issues.apache.org/jira/browse/LUCENE-7509
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.2.1
> Environment: Mac OS X 10.10
>Reporter: peina
>  Labels: chinese, tokenization
>
> Some chinese text is not tokenized correctly with Chinese punctuation marks 
> appended.
> e.g.
> 碧绿的眼珠 is tokenized as 碧绿|的|眼珠. Which is correct.
> But 
> 碧绿的眼珠,(with a Chinese punctuation appended )is tokenized as 碧绿|的|眼|珠,
> The similar case happens when text with numbers appended.
> e.g.
> 生活报8月4号 -->生活|报|8|月|4|号
> 生活报-->生活报
> Test Sample:
> public static void main(String[] args) throws IOException{
> Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */
> System.out.println("Sample1===");
> String sentence = "生活报8月4号";
> printTokens(analyzer, sentence);
> sentence = "生活报";
> printTokens(analyzer, sentence);
> System.out.println("Sample2===");
> 
> sentence = "碧绿的眼珠,";
> printTokens(analyzer, sentence);
> sentence = "碧绿的眼珠";
> printTokens(analyzer, sentence);
> 
> analyzer.close();
>   }
>   private static void printTokens(Analyzer analyzer, String sentence) throws 
> IOException{
> System.out.println("sentence:" + sentence);
> TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
> tokens.reset();
> CharTermAttribute termAttr = (CharTermAttribute) 
> tokens.getAttribute(CharTermAttribute.class);
> while (tokens.incrementToken()) {
>   System.out.println(termAttr.toString());
> }
> tokens.close();
>   }
> Output:
> Sample1===
> sentence:生活报8月4号
> 生活
> 报
> 8
> 月
> 4
> 号
> sentence:生活报
> 生活报
> Sample2===
> sentence:碧绿的眼珠,
> 碧绿
> 的
> 眼
> 珠
> sentence:碧绿的眼珠
> 碧绿
> 的
> 眼珠



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org