AM created LUCENE-7393:
--------------------------

             Summary: Incorrect ICUTokenization on South East Asian Language
                 Key: LUCENE-7393
                 URL: https://issues.apache.org/jira/browse/LUCENE-7393
             Project: Lucene - Core
          Issue Type: Bug
          Components: modules/analysis
    Affects Versions: 5.5
         Environment: Ubuntu
            Reporter: AM


Lucene 4.10.3 correctly tokenize a syllable into one token.  However in Lucune 
5.5.0 it end up being two tokens which is incorrect.  Please let me know 
segmentation rules are implemented by native speakers of a particular language? 
In this particular example, it is Myanmar language.  I have understood that 
Lao, Kmer and Myanmar fall into ICU category.  Thanks a lot.

h4. Example 4.10.3

{code:javascript}
GET _analyze?tokenizer=icu_tokenizer&text="နည်"
{
   "tokens": [
      {
         "token": "နည်",
         "start_offset": 1,
         "end_offset": 4,
         "type": "<ALPHANUM>",
         "position": 1
      }
   ]
}
{code}


h4. Example 5.5.0

{code:javascript}
GET _analyze?tokenizer=icu_tokenizer&text="နည်"
{
  "tokens": [
    {
      "token": "န",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "ည်",
      "start_offset": 1,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to