Re: Empty regexp replaceall and surrogate pairs results in corrupted utf16.

Xueming Shen Thu, 07 Jun 2012 15:44:02 -0700

Personally I don't think it is a bug. A j.l.String represents a sequenceof UTF-16 chars. Whilea pair of surrogates represents a supplementary character, a singlesurrogate itself is stilla "legal" independent entity inside a String object and length of aString is still defined asthe total number of char unit and an index value between a highsurrogate and a lowsurrogate is still a legal index value that can be used to access thechar at that particularposition. Using an empty String "" as a regex for the replaceAll() takesthe advantage of thespecial meaning of "", in which it is interpreted as it can match anypossible zero-widthposition of the target String, it does not imply anything regarding"character" or"characters" around it, so I would not interpret it as a zero-withcharacter boundary,therefor a "position" in between a pair surrogates is still a good"found" for replacing.


-Sherman


On 6/7/2012 1:07 PM, Dawid Weiss wrote:

Hi, I'm a committer to the Apache Lucene project. We have randomized
tests and one seed hit the following (simplified) scenario:

    String s1 = "AB\uD840\uDC00C";
    String s2 = s1.replaceAll("", "X");

the input contains an extended unicode character (any surrogate pair
will do). The pattern is an empty string (in fact, it was randomized
as "]|" but it's the same problem so I omit the details). The problem
is that after applying this pattern, replaceAll inserts X in between
the surrogate pair characters and this results in invalid UTF-16:

AB𠀀C
XAXBX?X?XCX

I believe this is a bug in the regexp implementation (sorry, don't
have a patch for it) but I'd like to confirm it's not something known.
Pointers appreciated.

Dawid

Re: Empty regexp replaceall and surrogate pairs results in corrupted utf16.

Reply via email to