Empty regexp replaceall and surrogate pairs results in corrupted utf16.

Dawid Weiss Thu, 07 Jun 2012 13:08:08 -0700

Hi, I'm a committer to the Apache Lucene project. We have randomized
tests and one seed hit the following (simplified) scenario:


   String s1 = "AB\uD840\uDC00C";
   String s2 = s1.replaceAll("", "X");

the input contains an extended unicode character (any surrogate pair
will do). The pattern is an empty string (in fact, it was randomized
as "]|" but it's the same problem so I omit the details). The problem
is that after applying this pattern, replaceAll inserts X in between
the surrogate pair characters and this results in invalid UTF-16:

AB𠀀C
XAXBX?X?XCX

I believe this is a bug in the regexp implementation (sorry, don't
have a patch for it) but I'd like to confirm it's not something known.
Pointers appreciated.

Dawid

Empty regexp replaceall and surrogate pairs results in corrupted utf16.

Reply via email to