[ https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16925588#comment-16925588 ]
Jim Ferenczi commented on LUCENE-8966: -------------------------------------- I don't think it's a bug [~danmuzi] or at least that it's related to this issue. In your example the first dot ('.' is a word dictionary) is considered a better path than grouping all dots eagerly. We process the unknown words greedily so we compare the path "[4], [.], [.....]" with "[4], [.], [.], [....]", "[4], [.], [.], [.], [...]", ... "[4], [......]". Keeping the first dot separated from the rest indicates that a number followed by a dot is a better splitting path than multiple dots in our model. We can discuss this behavior in a new issue if you think this should be configurable (for instance the JapaneseTokenizer process unknown words greedily only in search mode) ? > KoreanTokenizer should split unknown words on digits > ---------------------------------------------------- > > Key: LUCENE-8966 > URL: https://issues.apache.org/jira/browse/LUCENE-8966 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Jim Ferenczi > Priority: Minor > Attachments: LUCENE-8966.patch, LUCENE-8966.patch > > > Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer > groups characters of unknown words if they belong to the same script or an > inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the > rest in Latin) but this rule doesn't work well on digits since they are > considered common with other scripts. For instance the input "44사이즈" is kept > as is even though "사이즈" is part of the dictionary. We should restore the > original behavior and splits any unknown words if a digit is followed by > another type. > This issue was first discovered in > [https://github.com/elastic/elasticsearch/issues/46365] -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org