[ 
https://issues.apache.org/jira/browse/LUCENE-5927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126059#comment-14126059
 ] 

Steve Rowe commented on LUCENE-5927:
------------------------------------

These characters are in the Tai Tham block, all characters in which have the 
property {{LB:ComplexContext}}, sequences of which are returned as token type 
{{<SOUTHEAST_ASIAN>}}. 

This behavior change is caused by a grammar fix I included with LUCENE-5770 - 
previous to 4.10, the grammar did not include {{WB:Format}} or {{WB:Extend}} 
chars - here are the relevant parts from the 4.9 grammar:

{noformat}
ContextSupp = ([])  // no supplementary characters in {{LB:ComplexContext}} in 
Unicode 6.3
...
ComplexContext    = (\p{LB:Complex_Context} | {ComplexContextSupp})
...
{ComplexContext}+ { return SOUTH_EAST_ASIAN_TYPE; }
{noformat}

and the 4.10 grammar is now (note the addition of {{WB:Format}} and 
{{WB:Extend}} chars):

{noformat}
ComplexContextEx = \p{LB:Complex_Context} [\p{WB:Format}\p{WB:Extend}]*
...
{ComplexContextEx}+ { return SOUTH_EAST_ASIAN_TYPE; }
{noformat}


> 4.9 -> 4.10 change in StandardTokenizer behavior on \u1aa2
> ----------------------------------------------------------
>
>                 Key: LUCENE-5927
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5927
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Ryan Ernst
>
> In 4.9, this string was broken into 2 tokens by StandardTokenizer:
> "\u1aa2\u1a7f\u1a6f\u1a6f\u1a61\u1a72" = "\u1aa2", " 
> \u1a7f\u1a6f\u1a6f\u1a61\u1a72"
> However, in 4.10, that has changed so it is now a single token returned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to