Thanks Sherman!

Am 08.06.2012 20:36, schrieb Xueming Shen:
On 06/08/2012 05:16 AM, Ulf Zibis wrote:


Is there any spec weather the Java Regex API has a general contract with 16-bit chars or Unicode codepoints?

The regex spec says Pattern and Matcher work ON character sequence with the 
reference to
CharSequence interface,  but the pattern itself does support Unicode character 
via various
regex constructors and flags.
In other words, if there is a surrogate pair in the pattern, the CharSequence is seen as sequence of Unicode code points, right?
"\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[AB\uD840\uDC01C]", "?")
==> "\uD840\uDC00?\uD840\uDC02"         // only 1 replacement for \uD840\uDC01
"12\uD840\uDC02".replaceAll("[^0-9]", "?")
==> "12??"          // 2 replacements for \uD840\uDC02
"\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[^uD840\uDC00-\uD840\uDC01]", 
"?")
==> "\uD840\uDC00\uD840\uDC01?"          // only 1 replacement for \uD840\uDC02


An empty String pattern is really a corner case here, it does
not say anything about "character"
So it should be specified in the javadoc, and I'm with Dawid to implement it as 
in Python.

-Ulf

Reply via email to