chengpohi created LUCENE-8325:
---------------------------------
Summary: smartcn analyzer can't handle SURROGATE char
Key: LUCENE-8325
URL: https://issues.apache.org/jira/browse/LUCENE-8325
Project: Lucene - Core
Issue Type: Bug
Reporter: chengpohi
Attachments: handle-surrogate-char-for-smartcn.patch
This issue is from [smartcn_tokenizer
...](https://github.com/elastic/elasticsearch/issues/30739)
smartcn analyzer can't handle SURROGATE char, Example:
{code:java}
Analyzer ca = new SmartChineseAnalyzer();
String sentence = "\uD862\uDE0F"; // 𨨏 a surrogate char
TokenStream tokenStream = ca.tokenStream("", sentence);
CharTermAttribute charTermAttribute =
tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
String term = charTermAttribute.toString();
System.out.println(term);
}
{code}
In the above code snippet will output:
{code:java}
?
?
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]