[jira] [Commented] (LUCENE-8772) [nori] A word that is registered in advance, but the words are not separated and recognized as 'UNKNOWN'
[ https://issues.apache.org/jira/browse/LUCENE-8772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822831#comment-16822831 ] YOO JEONGIN commented on LUCENE-8772: - hello, [~jim.ferenczi] thank you for the reply. Even if the cost increases, I think that the words in advance should be changed to be recognizable. I know which part I need to fix, but I do not know how to fix it. Could you tell me the revision code? > [nori] A word that is registered in advance, but the words are not separated > and recognized as 'UNKNOWN' > - > > Key: LUCENE-8772 > URL: https://issues.apache.org/jira/browse/LUCENE-8772 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 7.5, 7.6, 7.7, 7.7.1, 8.0 >Reporter: YOO JEONGIN >Priority: Major > Attachments: image-2019-04-19-11-32-56-310.png > > > hello, > In case of 'nori', if there is no word starting from the left, 'UNKNOWN' is > analyzed even if there is a word already registered in the middle. > So here is the question. > Does nori analyze only on the left side and do not analyze from the right > side? > Could this be solved? > > ex) > input => 갊수학 > Condition > dictionary registered : 수학 > dictionary Unregistered : 갊 > result => 갊수학 > !image-2019-04-19-11-32-56-310.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8772) [nori] A word that is registered in advance, but the words are not separated and recognized as 'UNKNOWN'
[ https://issues.apache.org/jira/browse/LUCENE-8772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821810#comment-16821810 ] Jim Ferenczi commented on LUCENE-8772: -- That's expected since the unknown word heuristic is to group characters of the same class together. In this case `갊수학` is considered as a single word and `갊` is unknown so we jump to the end of the unknown word to find new entries. You can add `갊` in the user dict or a special rule `갊수학 갊 수학` that will decompose the terms. We could also change the heuristic to add unknown word of length 1 in order to be able to detect user words inside unknown blocks but I wonder if the cost to do that is not prohibitive. > [nori] A word that is registered in advance, but the words are not separated > and recognized as 'UNKNOWN' > - > > Key: LUCENE-8772 > URL: https://issues.apache.org/jira/browse/LUCENE-8772 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 7.5, 7.6, 7.7, 7.7.1, 8.0 >Reporter: YOO JEONGIN >Priority: Major > Attachments: image-2019-04-19-11-32-56-310.png > > > hello, > In case of 'nori', if there is no word starting from the left, 'UNKNOWN' is > analyzed even if there is a word already registered in the middle. > So here is the question. > Does nori analyze only on the left side and do not analyze from the right > side? > Could this be solved? > > ex) > input => 갊수학 > Condition > dictionary registered : 수학 > dictionary Unregistered : 갊 > result => 갊수학 > !image-2019-04-19-11-32-56-310.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org