Thanks Sherman!
Am 08.06.2012 20:36, schrieb Xueming Shen:
On 06/08/2012 05:16 AM, Ulf Zibis wrote:
Is there any spec weather the Java Regex API has a general contract with 16-bit chars or Unicode
codepoints?
The regex spec says Pattern and Matcher work ON character sequence with the
reference to
CharSequence interface, but the pattern itself does support Unicode character
via various
regex constructors and flags.
In other words, if there is a surrogate pair in the pattern, the CharSequence is seen as sequence of
Unicode code points, right?
"\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[AB\uD840\uDC01C]", "?")
==> "\uD840\uDC00?\uD840\uDC02" // only 1 replacement for \uD840\uDC01
"12\uD840\uDC02".replaceAll("[^0-9]", "?")
==> "12??" // 2 replacements for \uD840\uDC02
"\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[^uD840\uDC00-\uD840\uDC01]",
"?")
==> "\uD840\uDC00\uD840\uDC01?" // only 1 replacement for \uD840\uDC02
An empty String pattern is really a corner case here, it does
not say anything about "character"
So it should be specified in the javadoc, and I'm with Dawid to implement it as
in Python.
-Ulf