[jira] [Updated] (LUCENE-8125) emoji sequence support in ICUTokenizer

Robert Muir (JIRA) Mon, 08 Jan 2018 18:31:12 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Muir updated LUCENE-8125:
--------------------------------
    Attachment: LUCENE-8125.patch

Updated patch just with some code comments explaining the logic, in particular 
documenting that its not a perfect science and some alternatives that we could 
do. The current algorithm is very conservative.

In the ICU case the word break rules use "extended text segmentation rules from 
CLDR", so breaks themselves also use an {{$Extended_Pict}} set, which is a 
subset of {{\[:Extended_Pictographic:]-\[:Emoji:]}}, but being maintained 
manually I guess?

anyway the logic here could be substantially more aggressive, but I wanted to 
start with something more simple and by the book, so to speak. 

For more information, see: 
* http://unicode.org/reports/tr29/#WB3c
* https://www.unicode.org/reports/tr51/#Identification
* https://www.unicode.org/repos/cldr/trunk/common/segments/root.xml
* 
http://source.icu-project.org/repos/icu/trunk/icu4c/source/data/brkitr/rules/word.txt

> emoji sequence support in ICUTokenizer
> --------------------------------------
>
>                 Key: LUCENE-8125
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8125
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-8125.patch, LUCENE-8125.patch, LUCENE-8125.patch
>
>
> uax29 word break rules already know how to handle these correctly, we just 
> need to assign them a token type. 
> This is better than users trying to do this with custom rules (e.g. 
> LUCENE-7916) because they are script-independent (common/inherited).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-8125) emoji sequence support in ICUTokenizer

Reply via email to