Personally I don't think it is a bug. A j.l.String represents a sequence of UTF-16 chars. While a pair of surrogates represents a supplementary character, a single surrogate itself is still a "legal" independent entity inside a String object and length of a String is still defined as the total number of char unit and an index value between a high surrogate and a low surrogate is still a legal index value that can be used to access the char at that particular position. Using an empty String "" as a regex for the replaceAll() takes the advantage of the special meaning of "", in which it is interpreted as it can match any possible zero-width position of the target String, it does not imply anything regarding "character" or "characters" around it, so I would not interpret it as a zero-with character boundary, therefor a "position" in between a pair surrogates is still a good "found" for replacing.

-Sherman

On 6/7/2012 1:07 PM, Dawid Weiss wrote:
Hi, I'm a committer to the Apache Lucene project. We have randomized
tests and one seed hit the following (simplified) scenario:

    String s1 = "AB\uD840\uDC00C";
    String s2 = s1.replaceAll("", "X");

the input contains an extended unicode character (any surrogate pair
will do). The pattern is an empty string (in fact, it was randomized
as "]|" but it's the same problem so I omit the details). The problem
is that after applying this pattern, replaceAll inserts X in between
the surrogate pair characters and this results in invalid UTF-16:

AB𠀀C
XAXBX?X?XCX

I believe this is a bug in the regexp implementation (sorry, don't
have a patch for it) but I'd like to confirm it's not something known.
Pointers appreciated.

Dawid

Reply via email to