On 06/08/2012 12:07 PM, Ulf Zibis wrote:
Thanks Sherman!
Am 08.06.2012 20:36, schrieb Xueming Shen:
On 06/08/2012 05:16 AM, Ulf Zibis wrote:
Is there any spec weather the Java Regex API has a general contract
with 16-bit chars or Unicode codepoints?
The regex spec says Pattern and Matcher work ON character sequence
with the reference to
CharSequence interface, but the pattern itself does support Unicode
character via various
regex constructors and flags.
In other words, if there is a surrogate pair in the pattern, the
CharSequence is seen as sequence of Unicode code points, right?
No exactly what I meant.
The engine currently works as
if the pattern is to match a "character" or "slice of characters" that
has supplementary
character embedded, engine will try to interpret the target char
sequence as a sequence
of Unicode code point.
If the pattern is not to match a "character" or match a slice of
characters that does
not have supplementary character embedded, the engine will try to
interpret the char
sequence as a sequence of char unit.
For example
Matcher m =
Pattern.compile("[^a]").matcher("\uD840\uDC00\uD840\uDC01\uD840\uDC02");
while (m.find()) {
System.out.printf("<%d, %d>%n", m.start(), m.end());
}
The output is
<0, 2>
<2, 4>
<4, 6>
The target string is iterated code point by code point, but
Matcher m =
Pattern.compile("(?=[^a])").matcher("\uD840\uDC00\uD840\uDC01\uD840\uDC02");
while (m.find()) {
System.out.printf("<%d, %d>%n", m.start(), m.end());
}
The output is
<0, 0>
<1, 1>
<2, 2>
<3, 3>
<4, 4>
<5, 5>
And the empty string pattern belongs to the latter case.
No, I'm not saying because the implementation works this way, therefor
this is not a bug:-)
Actually I'm starting to agree that we might not want to stop in the
middle of a pair of
surrogates, even in non-character case. But it might have some
performance impact
somewhere (if you iterate the CharSequence by code point).
-Sherman
"\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[AB\uD840\uDC01C]",
"?")
==> "\uD840\uDC00?\uD840\uDC02" // only 1 replacement for
\uD840\uDC01
"12\uD840\uDC02".replaceAll("[^0-9]", "?")
==> "12??" // 2 replacements for \uD840\uDC02
"\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[^uD840\uDC00-\uD840\uDC01]",
"?")
==> "\uD840\uDC00\uD840\uDC01?" // only 1 replacement for
\uD840\uDC02
An empty String pattern is really a corner case here, it does
not say anything about "character"
So it should be specified in the javadoc, and I'm with Dawid to
implement it as in Python.
-Ulf