[ https://issues.apache.org/jira/browse/LUCENE-8325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
chengpohi updated LUCENE-8325: ------------------------------ Attachment: (was: handle-surrogate-char-for-smartcn.patch) > smartcn analyzer can't handle SURROGATE char > -------------------------------------------- > > Key: LUCENE-8325 > URL: https://issues.apache.org/jira/browse/LUCENE-8325 > Project: Lucene - Core > Issue Type: Bug > Reporter: chengpohi > Priority: Minor > Labels: newbie, patch > > This issue is from [https://github.com/elastic/elasticsearch/issues/30739] > smartcn analyzer can't handle SURROGATE char, Example: > > > {code:java} > Analyzer ca = new SmartChineseAnalyzer(); > String sentence = "\uD862\uDE0F"; // 𨨏 a surrogate char > TokenStream tokenStream = ca.tokenStream("", sentence); > CharTermAttribute charTermAttribute = > tokenStream.addAttribute(CharTermAttribute.class); > tokenStream.reset(); > while (tokenStream.incrementToken()) { > String term = charTermAttribute.toString(); > System.out.println(term); > } > {code} > > In the above code snippet will output: > > {code:java} > ? > ? > {code} > > and I have created a *PATCH* to try to fix this, please help review(since > *smartcn* only support *GBK* char, so it's only just handle it as a *single > char*). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org