[jira] [Commented] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

Jim Ferenczi (Jira) Mon, 09 Sep 2019 03:54:48 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16925588#comment-16925588
 ]


Jim Ferenczi commented on LUCENE-8966:
--------------------------------------

I don't think it's a bug [~danmuzi] or at least that it's related to this 
issue. In your example the first dot ('.' is a word dictionary) is considered a 
better path than grouping all dots eagerly. We process the unknown words 
greedily so we compare the path "[4], [.], [.....]" with  "[4], [.], [.], 
[....]", "[4], [.], [.], [.], [...]", ... "[4], [......]". Keeping the first 
dot separated from the rest indicates that a number followed by a dot is a 
better splitting path than multiple dots in our model. We can discuss this 
behavior in a new issue if you think this should be configurable (for instance 
the JapaneseTokenizer process unknown words greedily only in search mode) ?

> KoreanTokenizer should split unknown words on digits
> ----------------------------------------------------
>
>                 Key: LUCENE-8966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8966
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>         Attachments: LUCENE-8966.patch, LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

Reply via email to