[ https://issues.apache.org/jira/browse/LUCENE-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir resolved LUCENE-7393. --------------------------------- Resolution: Fixed Fix Version/s: 6.2 master (7.0) Thanks for reporting this AM. > Incorrect ICUTokenization on South East Asian Language > ------------------------------------------------------ > > Key: LUCENE-7393 > URL: https://issues.apache.org/jira/browse/LUCENE-7393 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 5.5 > Environment: Ubuntu > Reporter: AM > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7393.patch > > > Lucene 4.10.3 correctly tokenize a syllable into one token. However in > Lucune 5.5.0 it end up being two tokens which is incorrect. Please let me > know segmentation rules are implemented by native speakers of a particular > language? In this particular example, it is M-y-a-n-m-a-r language. I have > understood that L-a-o, K-m-e-r and M-y-a-n-m-a-r fall into ICU category. > Thanks a lot. > h4. Example 4.10.3 > {code:javascript} > GET _analyze?tokenizer=icu_tokenizer&text="နည်" > { > "tokens": [ > { > "token": "နည်", > "start_offset": 1, > "end_offset": 4, > "type": "<ALPHANUM>", > "position": 1 > } > ] > } > {code} > h4. Example 5.5.0 > {code:javascript} > GET _analyze?tokenizer=icu_tokenizer&text="နည်" > { > "tokens": [ > { > "token": "န", > "start_offset": 0, > "end_offset": 1, > "type": "<ALPHANUM>", > "position": 0 > }, > { > "token": "ည်", > "start_offset": 1, > "end_offset": 3, > "type": "<ALPHANUM>", > "position": 1 > } > ] > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org