[
https://issues.apache.org/jira/browse/LUCENE-5927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126035#comment-14126035
]
Robert Muir commented on LUCENE-5927:
-------------------------------------
>From ryan's explanation to me, this only impacts stuff with line break of
>complex context (it will wrongly split on a combining mark). I think we can be
>sensible about what we do here (I suggest: nothing), because in such a case
>you arent getting "useful" tokens from the tokenizer anyway unless you are
>doing downstream processing... and if you are doing that, its very good that
>this bug is fixed.
> 4.9 -> 4.10 change in StandardTokenizer behavior on \u1aa2
> ----------------------------------------------------------
>
> Key: LUCENE-5927
> URL: https://issues.apache.org/jira/browse/LUCENE-5927
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Ryan Ernst
>
> In 4.9, this string was broken into 2 tokens by StandardTokenizer:
> "\u1aa2\u1a7f\u1a6f\u1a6f\u1a61\u1a72" = "\u1aa2", "
> \u1a7f\u1a6f\u1a6f\u1a61\u1a72"
> However, in 4.10, that has changed so it is now a single token returned.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]