On 06/08/2012 12:07 PM, Ulf Zibis wrote:
Thanks Sherman!
Am 08.06.2012 20:36, schrieb Xueming Shen:
On 06/08/2012 05:16 AM, Ulf Zibis wrote:
Is there any spec weather the Java Regex API has a general contract
with 16-bit chars or Unicode codepoints?
The regex spec says Pattern and Matcher
Thanks Sherman!
Am 08.06.2012 20:36, schrieb Xueming Shen:
On 06/08/2012 05:16 AM, Ulf Zibis wrote:
Is there any spec weather the Java Regex API has a general contract with 16-bit chars or Unicode
codepoints?
The regex spec says Pattern and Matcher work ON character sequence with the
refe
On 06/08/2012 05:16 AM, Ulf Zibis wrote:
Is there any spec weather the Java Regex API has a general contract
with 16-bit chars or Unicode codepoints?
The regex spec says Pattern and Matcher work ON character sequence with
the reference to
CharSequence interface, but the pattern itself does
If we can re-design everything (not the lib, but the language) allover
again from the
very beginning , and if we all put an i18n engr's hat on:-) it might be
nature to have
a 32-bit char instead of the 16-bit (OK, it's totally a difference story
if from
performance point of view), and then we w
Oops, correction:
StringBuilder sb = new StringBuilder(s1.length * 2 + 1);
for (char c : s1.getChars())
sb.append('X').append(c);
String s2 = sb.append('X').toString();
Am 08.06.2012 14:16, schrieb Ulf Zibis:
I tend to agree Dawid.
Especially the comparison with Python behaviour is demonstr
I tend to agree Dawid.
Especially the comparison with Python behaviour is demonstrative.
Is there any spec weather the Java Regex API has a general contract with 16-bit chars or Unicode
codepoints?
Thinking about the search pattern e.g. "[AB\uD840\uDC00C]"; what does it actually search for, th
I guess a lot depends on the point of view. From historical point of
view (where a char[] and a String are basically unsigned values) that
pattern should simply process every value (index) and work like you
say. But from a practical point of view I think it is a bug -- it
corrupts the string, trans
Personally I don't think it is a bug. A j.l.String represents a sequence
of UTF-16 chars. While
a pair of surrogates represents a supplementary character, a single
surrogate itself is still
a "legal" independent entity inside a String object and length of a
String is still defined as
the total n
Hi, I'm a committer to the Apache Lucene project. We have randomized
tests and one seed hit the following (simplified) scenario:
String s1 = "AB\uD840\uDC00C";
String s2 = s1.replaceAll("", "X");
the input contains an extended unicode character (any surrogate pair
will do). The pattern is a