[jira] [Commented] (LUCENE-7393) Incorrect ICUTokenization on South East Asian Language

AM (JIRA) Sun, 24 Jul 2016 05:35:42 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391046#comment-15391046
 ]


AM commented on LUCENE-7393:
----------------------------

Thank you Robert.  4.10.x does a good job even for words borrowed from foreign 
language.  For example, it would correctly segment ဘူးလ် as one syllable.  
There are exceptional rules applied for loan words and it seems like rules in 
.rbbi file captures it correctly even for these exceptions.  However again in 
5.5.0 it would end up two syllables ဘူး and လ် which is not correct.  Hand 
coding all the logic in ICU's dictionary-based algorithm seems to be quite 
challenging.  Rules are more compact and does it nicely I think. 

Please let me know where to find dictionary words use in ICU4j?

Many thanks.

> Incorrect ICUTokenization on South East Asian Language
> ------------------------------------------------------
>
>                 Key: LUCENE-7393
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7393
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 5.5
>         Environment: Ubuntu
>            Reporter: AM
>
> Lucene 4.10.3 correctly tokenize a syllable into one token.  However in 
> Lucune 5.5.0 it end up being two tokens which is incorrect.  Please let me 
> know segmentation rules are implemented by native speakers of a particular 
> language? In this particular example, it is M-y-a-n-m-a-r language.  I have 
> understood that L-a-o, K-m-e-r and M-y-a-n-m-a-r fall into ICU category.  
> Thanks a lot.
> h4. Example 4.10.3
> {code:javascript}
> GET _analyze?tokenizer=icu_tokenizer&text="နည်"
> {
>    "tokens": [
>       {
>          "token": "နည်",
>          "start_offset": 1,
>          "end_offset": 4,
>          "type": "<ALPHANUM>",
>          "position": 1
>       }
>    ]
> }
> {code}
> h4. Example 5.5.0
> {code:javascript}
> GET _analyze?tokenizer=icu_tokenizer&text="နည်"
> {
>   "tokens": [
>     {
>       "token": "န",
>       "start_offset": 0,
>       "end_offset": 1,
>       "type": "<ALPHANUM>",
>       "position": 0
>     },
>     {
>       "token": "ည်",
>       "start_offset": 1,
>       "end_offset": 3,
>       "type": "<ALPHANUM>",
>       "position": 1
>     }
>   ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7393) Incorrect ICUTokenization on South East Asian Language

Reply via email to