Re: Empty regexp replaceall and surrogate pairs results in corrupted utf16.

2012-06-08 Thread Xueming Shen
On 06/08/2012 12:07 PM, Ulf Zibis wrote: Thanks Sherman! Am 08.06.2012 20:36, schrieb Xueming Shen: On 06/08/2012 05:16 AM, Ulf Zibis wrote: Is there any spec weather the Java Regex API has a general contract with 16-bit chars or Unicode codepoints? The regex spec says Pattern and Matcher

Re: Empty regexp replaceall and surrogate pairs results in corrupted utf16.

2012-06-08 Thread Ulf Zibis
Thanks Sherman! Am 08.06.2012 20:36, schrieb Xueming Shen: On 06/08/2012 05:16 AM, Ulf Zibis wrote: Is there any spec weather the Java Regex API has a general contract with 16-bit chars or Unicode codepoints? The regex spec says Pattern and Matcher work ON character sequence with the refe

Re: Empty regexp replaceall and surrogate pairs results in corrupted utf16.

2012-06-08 Thread Xueming Shen
On 06/08/2012 05:16 AM, Ulf Zibis wrote: Is there any spec weather the Java Regex API has a general contract with 16-bit chars or Unicode codepoints? The regex spec says Pattern and Matcher work ON character sequence with the reference to CharSequence interface, but the pattern itself does

Re: Empty regexp replaceall and surrogate pairs results in corrupted utf16.

2012-06-08 Thread Xueming Shen
If we can re-design everything (not the lib, but the language) allover again from the very beginning , and if we all put an i18n engr's hat on:-) it might be nature to have a 32-bit char instead of the 16-bit (OK, it's totally a difference story if from performance point of view), and then we w

Re: Empty regexp replaceall and surrogate pairs results in corrupted utf16.

2012-06-08 Thread Ulf Zibis
Oops, correction: StringBuilder sb = new StringBuilder(s1.length * 2 + 1); for (char c : s1.getChars()) sb.append('X').append(c); String s2 = sb.append('X').toString(); Am 08.06.2012 14:16, schrieb Ulf Zibis: I tend to agree Dawid. Especially the comparison with Python behaviour is demonstr

Re: Empty regexp replaceall and surrogate pairs results in corrupted utf16.

2012-06-08 Thread Ulf Zibis
I tend to agree Dawid. Especially the comparison with Python behaviour is demonstrative. Is there any spec weather the Java Regex API has a general contract with 16-bit chars or Unicode codepoints? Thinking about the search pattern e.g. "[AB\uD840\uDC00C]"; what does it actually search for, th

Re: Empty regexp replaceall and surrogate pairs results in corrupted utf16.

2012-06-08 Thread Dawid Weiss
I guess a lot depends on the point of view. From historical point of view (where a char[] and a String are basically unsigned values) that pattern should simply process every value (index) and work like you say. But from a practical point of view I think it is a bug -- it corrupts the string, trans

Re: Empty regexp replaceall and surrogate pairs results in corrupted utf16.

2012-06-07 Thread Xueming Shen
Personally I don't think it is a bug. A j.l.String represents a sequence of UTF-16 chars. While a pair of surrogates represents a supplementary character, a single surrogate itself is still a "legal" independent entity inside a String object and length of a String is still defined as the total n

Empty regexp replaceall and surrogate pairs results in corrupted utf16.

2012-06-07 Thread Dawid Weiss
Hi, I'm a committer to the Apache Lucene project. We have randomized tests and one seed hit the following (simplified) scenario: String s1 = "AB\uD840\uDC00C"; String s2 = s1.replaceAll("", "X"); the input contains an extended unicode character (any surrogate pair will do). The pattern is a