Re: Empty regexp replaceall and surrogate pairs results in corrupted utf16.

Xueming Shen Fri, 08 Jun 2012 15:24:50 -0700

On 06/08/2012 12:07 PM, Ulf Zibis wrote:

Thanks Sherman!
Am 08.06.2012 20:36, schrieb Xueming Shen:
On 06/08/2012 05:16 AM, Ulf Zibis wrote:
Is there any spec weather the Java Regex API has a general contractwith 16-bit chars or Unicode codepoints?
The regex spec says Pattern and Matcher work ON character sequencewith the reference toCharSequence interface, but the pattern itself does support Unicodecharacter via various
regex constructors and flags.
In other words, if there is a surrogate pair in the pattern, theCharSequence is seen as sequence of Unicode code points, right?


No exactly what I meant.
The engine currently works as

if the pattern is to match a "character" or "slice of characters" thathas supplementarycharacter embedded, engine will try to interpret the target charsequence as a sequence

of Unicode code point.

If the pattern is not to match a "character" or match a slice ofcharacters that doesnot have supplementary character embedded, the engine will try tointerpret the char

sequence as a sequence of char unit.

For example

Matcher m =Pattern.compile("[^a]").matcher("\uD840\uDC00\uD840\uDC01\uD840\uDC02");

while (m.find()) {
    System.out.printf("<%d, %d>%n", m.start(), m.end());
}

The output is

<0, 2>
<2, 4>
<4, 6>

The target string is iterated code point by code point, but

Matcher m =Pattern.compile("(?=[^a])").matcher("\uD840\uDC00\uD840\uDC01\uD840\uDC02");

while (m.find()) {
    System.out.printf("<%d, %d>%n", m.start(), m.end());

}

The output is

<0, 0>
<1, 1>
<2, 2>
<3, 3>
<4, 4>
<5, 5>

And the empty string pattern belongs to the latter case.

No, I'm not saying because the implementation works this way, thereforthis is not a bug:-)Actually I'm starting to agree that we might not want to stop in themiddle of a pair ofsurrogates, even in non-character case. But it might have someperformance impact

somewhere (if you iterate the CharSequence by code point).

-Sherman

"\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[AB\uD840\uDC01C]","?")==> "\uD840\uDC00?\uD840\uDC02" // only 1 replacement for\uD840\uDC01
"12\uD840\uDC02".replaceAll("[^0-9]", "?")
==> "12??"          // 2 replacements for \uD840\uDC02
"\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[^uD840\uDC00-\uD840\uDC01]","?")==> "\uD840\uDC00\uD840\uDC01?" // only 1 replacement for\uD840\uDC02
An empty String pattern is really a corner case here, it does
not say anything about "character"
So it should be specified in the javadoc, and I'm with Dawid toimplement it as in Python.
-Ulf

Re: Empty regexp replaceall and surrogate pairs results in corrupted utf16.

Reply via email to