[jira] [Commented] (LUCENE-6993) Update UAX29URLEmailTokenizer TLDs to latest list, and upgrade all JFlex-based tokenizers to support Unicode 8.0

Mike Drob (JIRA) Thu, 18 Feb 2016 14:35:31 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15153258#comment-15153258
 ]


Mike Drob commented on LUCENE-6993:
-----------------------------------

Looking at http://unicode.org/reports/tr29/#Modifications I see

{noformat}
Revision 27 [KW, LI]

    Reissued for Unicode 8.0.
    Modified rule SB7 to prevent sentence breaks within a word segment such as 
“Mr.Hamster”.
    Updated notes on tailoring using CLDR boundary suppressions.
    Recast rule tables to use macros for compactness.
    Updated table styles, removed inconsistently applied styles on character 
names and code points, and adjusted layout of various tables and figures.
    Section 3.1 Default Grapheme Cluster Boundary Specification
        Removed the New Tai Lue characters U+19B0..U+19B4, U+19B8..U+19B9, 
U+19BB..U+19C0, U+19C8..U+19C9 from the exception list for SpacingMark in Table 
2, Grapheme_Cluster_Break Property Values.
        Added U+11720 AHOM VOWEL SIGN A and U+11721 AHOM VOWEL SIGN AA to the 
same exception list for SpacingMark.

Revision 26 being a proposed update, only changes between versions 27 and 25 
are noted here.
Revision 25

    Reissued for Unicode 7.0.
    General text cleanup, including “_” in property and property value names, 
use of curly-quotes and italics.
    Section 3.1 Default Grapheme Cluster Boundary Specification
        Added U+AA7D MYANMAR SIGN TAI LAING TONE-5 to the exception list for 
SpacingMark in Table 2, Grapheme_Cluster_Break Property Values.
    Section 5.1 Default Sentence Boundary Specification
        Added note to clarify that Format and Extend characters are not joined 
to separators like LF.
        Added note about the fact that words can span a sentence break.
{noformat}

I am by no means an expert in Unicode, but it looks like the Sentence Break 
rules are not relevant to us, right? But the Spacing Mark // Grapheme Cluster 
changes are relevant?
When you refer to the word break test data, is that something that the Unicode 
Consortium publishes or do you mean our internal data?

> Update UAX29URLEmailTokenizer TLDs to latest list, and upgrade all 
> JFlex-based tokenizers to support Unicode 8.0
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-6993
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6993
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Mike Drob
>            Assignee: Robert Muir
>             Fix For: 6.0
>
>         Attachments: LUCENE-6993.patch, LUCENE-6993.patch, LUCENE-6993.patch, 
> LUCENE-6993.patch
>
>
> We did this once before in LUCENE-5357, but it might be time to update the 
> list of TLDs again. Comparing our old list with a new list indicates 800+ new 
> domains, so it would be nice to include them.
> Also the JFlex tokenizer grammars should be upgraded to support Unicode 8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6993) Update UAX29URLEmailTokenizer TLDs to latest list, and upgrade all JFlex-based tokenizers to support Unicode 8.0

Reply via email to