[jira] [Commented] (LUCENE-5927) 4.9 -> 4.10 change in StandardTokenizer behavior on \u1aa2

Steve Rowe (JIRA) Mon, 08 Sep 2014 13:57:42 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126081#comment-14126081
 ]


Steve Rowe commented on LUCENE-5927:
------------------------------------

bq. I think we can be sensible about what we do here (I suggest: nothing), 
because in such a case you arent getting "useful" tokens from the tokenizer 
anyway unless you are doing downstream processing... and if you are doing that, 
its very good that this bug is fixed.

Version-specific behavior is important for people who don't want changes; IMHO 
everybody impacted by this change would want it, so I agree: we should do 
nothing.

> 4.9 -> 4.10 change in StandardTokenizer behavior on \u1aa2
> ----------------------------------------------------------
>
>                 Key: LUCENE-5927
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5927
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Ryan Ernst
>
> In 4.9, this string was broken into 2 tokens by StandardTokenizer:
> "\u1aa2\u1a7f\u1a6f\u1a6f\u1a61\u1a72" = "\u1aa2", " 
> \u1a7f\u1a6f\u1a6f\u1a61\u1a72"
> However, in 4.10, that has changed so it is now a single token returned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5927) 4.9 -> 4.10 change in StandardTokenizer behavior on \u1aa2

Reply via email to