[jira] [Commented] (LUCENE-8526) StandardTokenizer doesn't separate hangul characters from other non-CJK chars

Steve Rowe (JIRA) Fri, 05 Oct 2018 12:07:10 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640228#comment-16640228
 ]


Steve Rowe commented on LUCENE-8526:
------------------------------------

Hangul syllables' [UAX#29|https://www.unicode.org/reports/tr29/] word-break 
category is ALetter (see e.g. [the properties for 
U+AC00|https://unicode.org/cldr/utility/character.jsp?a=AC00]).  The word-break 
rules in UAX#29 don't have any special handling for Hangul syllables.

bq. The other CJK characters are correctly splitted when they are mixed with 
other alphabet so I'd expect the same for hangul.

UAX#29 word break rules include no provisions for breaking at script 
boundaries.  By contrast, ICUTokenizer does include this functionality.  From 
[https://lucene.apache.org/core/7_5_0/analyzers-icu/org/apache/lucene/analysis/icu/segmentation/ICUTokenizer.html]:

bq. Words are broken across script boundaries, then segmented according to the 
BreakIterator and typing provided by the ICUTokenizerConfig

The current StandardTokenizer implements Unicode 6.3, which is pretty old.  
Lucene should update to the recently released JFlex 1.7, which supports Unicode 
9.0. (I'll go make an issue.)  But I checked, and Unicode 11.0 still does not 
include any script-boundary splitting.

> StandardTokenizer doesn't separate hangul characters from other non-CJK chars
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-8526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8526
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>
> It was first reported here 
> https://github.com/elastic/elasticsearch/issues/34285.
> I don't know if it's the expected behavior but the StandardTokenizer does not 
> split words
> which are composed of a mixed of non-CJK characters and hangul syllabs. For 
> instance "한국2018" or "한국abc" is kept as is by this tokenizer and mark as an 
> alpha-numeric group. This breaks the CJKBigram token filter which will not 
> build bigrams on such groups. The other CJK characters are correctly splitted 
> when they are mixed with other alphabet so I'd expect the same for hangul.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8526) StandardTokenizer doesn't separate hangul characters from other non-CJK chars

Reply via email to