[
https://issues.apache.org/jira/browse/LUCENE-5927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126081#comment-14126081
]
Steve Rowe commented on LUCENE-5927:
------------------------------------
bq. I think we can be sensible about what we do here (I suggest: nothing),
because in such a case you arent getting "useful" tokens from the tokenizer
anyway unless you are doing downstream processing... and if you are doing that,
its very good that this bug is fixed.
Version-specific behavior is important for people who don't want changes; IMHO
everybody impacted by this change would want it, so I agree: we should do
nothing.
> 4.9 -> 4.10 change in StandardTokenizer behavior on \u1aa2
> ----------------------------------------------------------
>
> Key: LUCENE-5927
> URL: https://issues.apache.org/jira/browse/LUCENE-5927
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Ryan Ernst
>
> In 4.9, this string was broken into 2 tokens by StandardTokenizer:
> "\u1aa2\u1a7f\u1a6f\u1a6f\u1a61\u1a72" = "\u1aa2", "
> \u1a7f\u1a6f\u1a6f\u1a61\u1a72"
> However, in 4.10, that has changed so it is now a single token returned.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]