[jira] [Updated] (LUCENE-7393) Incorrect ICUTokenization on South East Asian Language

Robert Muir (JIRA) Sun, 24 Jul 2016 21:39:41 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Muir updated LUCENE-7393:
--------------------------------
    Attachment: LUCENE-7393.patch

Here is a patch restoring the previous rule-based algorithm as an option.

Since we may keep it around and improve it in the future, I added some simple 
tests.

Based on the rules and statistical analysis here, I think we should improve it 
further to handle more of the special cases (these cases account for less than 
1% but we should still try to do better)?
* http://www.aclweb.org/anthology/I08-3010
* http://gii2.nagaokaut.ac.jp/gii/media/share/20080901-ZMM%20Presentation.pdf

So as a followup issue, I think it would be good to simply adopt the algorithm 
they developed, to improve that additional 1%. The reason I do not do it here, 
is because maybe it is best to do that part in ICU itself. Their algorithm does 
not require huge amounts of context and can be implemented with tables and 
sets, might be a good solution for the ICU issue.



> Incorrect ICUTokenization on South East Asian Language
> ------------------------------------------------------
>
>                 Key: LUCENE-7393
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7393
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 5.5
>         Environment: Ubuntu
>            Reporter: AM
>         Attachments: LUCENE-7393.patch
>
>
> Lucene 4.10.3 correctly tokenize a syllable into one token.  However in 
> Lucune 5.5.0 it end up being two tokens which is incorrect.  Please let me 
> know segmentation rules are implemented by native speakers of a particular 
> language? In this particular example, it is M-y-a-n-m-a-r language.  I have 
> understood that L-a-o, K-m-e-r and M-y-a-n-m-a-r fall into ICU category.  
> Thanks a lot.
> h4. Example 4.10.3
> {code:javascript}
> GET _analyze?tokenizer=icu_tokenizer&text="နည်"
> {
>    "tokens": [
>       {
>          "token": "နည်",
>          "start_offset": 1,
>          "end_offset": 4,
>          "type": "<ALPHANUM>",
>          "position": 1
>       }
>    ]
> }
> {code}
> h4. Example 5.5.0
> {code:javascript}
> GET _analyze?tokenizer=icu_tokenizer&text="နည်"
> {
>   "tokens": [
>     {
>       "token": "န",
>       "start_offset": 0,
>       "end_offset": 1,
>       "type": "<ALPHANUM>",
>       "position": 0
>     },
>     {
>       "token": "ည်",
>       "start_offset": 1,
>       "end_offset": 3,
>       "type": "<ALPHANUM>",
>       "position": 1
>     }
>   ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-7393) Incorrect ICUTokenization on South East Asian Language

Reply via email to