[
https://issues.apache.org/jira/browse/LUCENE-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-8125:
--------------------------------
Attachment: LUCENE-8125.patch
Updated patch just with some code comments explaining the logic, in particular
documenting that its not a perfect science and some alternatives that we could
do. The current algorithm is very conservative.
In the ICU case the word break rules use "extended text segmentation rules from
CLDR", so breaks themselves also use an {{$Extended_Pict}} set, which is a
subset of {{\[:Extended_Pictographic:]-\[:Emoji:]}}, but being maintained
manually I guess?
anyway the logic here could be substantially more aggressive, but I wanted to
start with something more simple and by the book, so to speak.
For more information, see:
* http://unicode.org/reports/tr29/#WB3c
* https://www.unicode.org/reports/tr51/#Identification
* https://www.unicode.org/repos/cldr/trunk/common/segments/root.xml
*
http://source.icu-project.org/repos/icu/trunk/icu4c/source/data/brkitr/rules/word.txt
> emoji sequence support in ICUTokenizer
> --------------------------------------
>
> Key: LUCENE-8125
> URL: https://issues.apache.org/jira/browse/LUCENE-8125
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Robert Muir
> Attachments: LUCENE-8125.patch, LUCENE-8125.patch, LUCENE-8125.patch
>
>
> uax29 word break rules already know how to handle these correctly, we just
> need to assign them a token type.
> This is better than users trying to do this with custom rules (e.g.
> LUCENE-7916) because they are script-independent (common/inherited).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]