[jira] [Commented] (LUCENE-8526) StandardTokenizer doesn't separate hangul characters from other non-CJK chars

Steve Rowe (JIRA) Fri, 05 Oct 2018 12:52:12 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640279#comment-16640279
 ]


Steve Rowe commented on LUCENE-8526:
------------------------------------

bq. We can maybe add a note in the CJKBigram filter regarding this behavior 
when the StandardTokenizer is used ?

+1

How's this, to be added to the CJKBigramFilter class javadoc:

{noformat}
Unlike ICUTokenizer, StandardTokenizer does not split at script boundaries.  
Korean Hangul characters are treated the same as many other scripts' letters, 
and as a result, StandardTokenizer can produce tokens that mix Hangul and 
non-Hangul characters, e.g. "한국abc".  Such mixed-script tokens are typed as 
<code>&lt;ALPHANUM&gt;</code> rather than <code>&lt;HANGUL&gt;</code>, and as a 
result, will not be converted to bigrams by CJKBigramFilter. 
{noformat}


> StandardTokenizer doesn't separate hangul characters from other non-CJK chars
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-8526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8526
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>
> It was first reported here 
> https://github.com/elastic/elasticsearch/issues/34285.
> I don't know if it's the expected behavior but the StandardTokenizer does not 
> split words
> which are composed of a mixed of non-CJK characters and hangul syllabs. For 
> instance "한국2018" or "한국abc" is kept as is by this tokenizer and mark as an 
> alpha-numeric group. This breaks the CJKBigram token filter which will not 
> build bigrams on such groups. The other CJK characters are correctly splitted 
> when they are mixed with other alphabet so I'd expect the same for hangul.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8526) StandardTokenizer doesn't separate hangul characters from other non-CJK chars

Reply via email to