[ 
https://issues.apache.org/jira/browse/LUCENE-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176520#comment-15176520
 ] 

Robert Muir commented on LUCENE-6993:
-------------------------------------

I wouldnt change any of the ClassicTokenizer ranges, it should just continue to 
do what it did before.

not all of the thai characters are letters, and its important not to e.g. split 
on tone marks or make other mistakes like that: 
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3AThai%3A]-[%3ALetter%3A]]%0D%0A&g=&i=

CJ is a separate category because ClassicTokenizer will return each han 
character individually as token. On the other hand hangul (K) is kept with 
letter because it is an alphabet.

> Update UAX29URLEmailTokenizer TLDs to latest list, and upgrade all 
> JFlex-based tokenizers to support Unicode 8.0
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-6993
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6993
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Mike Drob
>            Assignee: Robert Muir
>             Fix For: 6.0
>
>         Attachments: LUCENE-6993.patch, LUCENE-6993.patch, LUCENE-6993.patch, 
> LUCENE-6993.patch, LUCENE-6993.patch
>
>
> We did this once before in LUCENE-5357, but it might be time to update the 
> list of TLDs again. Comparing our old list with a new list indicates 800+ new 
> domains, so it would be nice to include them.
> Also the JFlex tokenizer grammars should be upgraded to support Unicode 8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to