[jira] [Commented] (LUCENE-8772) [nori] A word that is registered in advance, but the words are not separated and recognized as 'UNKNOWN'

2019-04-21 Thread YOO JEONGIN (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822831#comment-16822831
 ] 

YOO JEONGIN commented on LUCENE-8772:
-

hello, [~jim.ferenczi]

thank you for the reply. Even if the cost increases, I think that the words in 
advance should be changed to be recognizable. I know which part I need to fix, 
but I do not know how to fix it. Could you tell me the revision code?

> [nori]  A word that is registered in advance, but the words are not separated 
> and recognized as 'UNKNOWN'
> -
>
> Key: LUCENE-8772
> URL: https://issues.apache.org/jira/browse/LUCENE-8772
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.5, 7.6, 7.7, 7.7.1, 8.0
>Reporter: YOO JEONGIN
>Priority: Major
> Attachments: image-2019-04-19-11-32-56-310.png
>
>
> hello,
> In case of 'nori', if there is no word starting from the left, 'UNKNOWN' is 
> analyzed even if there is a word already registered in the middle.
>  So here is the question.
>  Does nori analyze only on the left side and do not analyze from the right 
> side?
>  Could this be solved?
>  
> ex)
> input => 갊수학
> Condition
> dictionary registered : 수학
>  dictionary Unregistered : 갊
> result => 갊수학
> !image-2019-04-19-11-32-56-310.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8772) [nori] A word that is registered in advance, but the words are not separated and recognized as 'UNKNOWN'

2019-04-19 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821810#comment-16821810
 ] 

Jim Ferenczi commented on LUCENE-8772:
--

That's expected since the unknown word heuristic is to group characters of the 
same class together. In this case `갊수학` is considered as a single word and `갊` 
is unknown so we jump to the end of the unknown word to find new entries. You 
can add `갊` in the user dict or a special rule `갊수학 갊 수학` that will decompose 
the terms. We could also change the heuristic to add unknown word of length 1 
in order to be able to detect user words inside unknown blocks but I wonder if 
the cost to do that is not prohibitive.

> [nori]  A word that is registered in advance, but the words are not separated 
> and recognized as 'UNKNOWN'
> -
>
> Key: LUCENE-8772
> URL: https://issues.apache.org/jira/browse/LUCENE-8772
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.5, 7.6, 7.7, 7.7.1, 8.0
>Reporter: YOO JEONGIN
>Priority: Major
> Attachments: image-2019-04-19-11-32-56-310.png
>
>
> hello,
> In case of 'nori', if there is no word starting from the left, 'UNKNOWN' is 
> analyzed even if there is a word already registered in the middle.
>  So here is the question.
>  Does nori analyze only on the left side and do not analyze from the right 
> side?
>  Could this be solved?
>  
> ex)
> input => 갊수학
> Condition
> dictionary registered : 수학
>  dictionary Unregistered : 갊
> result => 갊수학
> !image-2019-04-19-11-32-56-310.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org