Yeongsu Kim created LUCENE-8706:
-----------------------------------

             Summary: Nori with DISCARD mode misunderstands compound words, 
when synonym expansion 
                 Key: LUCENE-8706
                 URL: https://issues.apache.org/jira/browse/LUCENE-8706
             Project: Lucene - Core
          Issue Type: Bug
          Components: modules/analysis
            Reporter: Yeongsu Kim


 

I found a bug in Nori tokenizer.

Let me describe what the problem is, using a concrete example.


 

Let assume, we have below dictionaries.


 

< userdict_ko.txt >

[ “lg”, “lgtv lg tv”, “tv”, “엘지티비”, “엘지”, “텔레비전”, “티비”, “하이” ]

(“lgtv” is compound word)


 

< synonyms.txt >

[ “lgtv,엘지티비”, “lg,엘지”, “tv,텔레비전,티비” ]

 

Let’s see the results according to below queries.

   * Query1 : lgtv

   * Query2 : lg하이tv 

   * Query3 : lg              tv


 

Also, we will use all different decompound-modes such as “NONE”, “DISCARD”, 
“MIXED”.

Here are test cases.


 

   * Test 1 (Query 1 + “MIXED”) - the analysis result is [“엘지티비”, “lgtv”, “lg”, 
“tv”]

   * Test 2 (Query 1 + “NONE”) - the analysis result is [“엘지티비”, “lgtv”]

   * Test 3 (Query 1 + “DISCARD”) - the analysis result is [“엘지티비”, “lg”, “tv”]

 

   * Test 4 (Query 2 + “MIXED”) - the analysis result is [“엘지”, “lg”, “하이”, 
“텔레비전”, “티비”, “tv”]

   * Test 5 (Query 2 + “NONE”) - the analysis result is [“엘지”, “lg”, “하이”, 
“텔레비전”, “티비”, “tv”]

   * Test 6 (Query 2 + “DISCARD”) - the analysis result is [“엘지”, “lg”, “하이”, 
“텔레비전”, “티비”, “tv”]
 

 

 

   * Test 7 (Query 3 + “MIXED”) - the analysis result is [“엘지”, “lg”, “텔레비전”, 
“티비”, “tv”]

   * Test 8 (Query 3 + “NONE”) - the analysis result is [“엘지”, “lg”, “텔레비전”, 
“티비”, “tv”]

   * Test 9 (Query 3 + “DISCARD”) - the analysis result is [“엘지티비”, “lg”, “tv”] 
  => (Here is the problem!!!)

 

I don’t understand why Test 9 has that analysis result. The result should be 
[“엘지”, “lg”, “텔레비전”, “티비”, “tv”] because the query 3 has some spaces between 
“lg” and “tv”.

 

The only difference between “DISCARD” and other modes, is that “DISCARD” do not 
store the compound token (e.g. “lgtv”) to the pending list. Since “DISCARD” do 
not have the compound token, it may understand consecutive tokens, “lg”, “tv” 
as compound token “lgtv”. However, there are many cases to make “lg”,”tv”. For 
example, “lg tv”, “lg * tv”, “lg /// tv”, etc. (Space and punctuations are 
deleted after tokenizing). It should differentiate “lg tv” from “lgtv”.

 

I guess that it needs to fix communication between nori tokenizer and general 
synonym filter.

Thanks.

 

P.S.

The existing nori has error when using both synonyms and “MIXED” mode. For this 
test, I temporarily delete `compoundToken.setPositionIncrement(0);` in 
KoreanTokenizer.java` because SynonymMap.java throws IllegalArgumentException 
when position increment is not 1.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to