Re: Empty regexp replaceall and surrogate pairs results in corrupted utf16.

Xueming Shen Fri, 08 Jun 2012 11:06:37 -0700

If we can re-design everything (not the lib, but the language) alloveragain from thevery beginning , and if we all put an i18n engr's hat on:-) it might benature to havea 32-bit char instead of the 16-bit (OK, it's totally a difference storyif fromperformance point of view), and then we would not have this kind oftrouble at all.

The reality is that String is still UTF16 and char based, Unicodesupplementarycode point support is an add-on, so the default index based iteration ischar based.

You will have to use those codePointXYZ() to do code point based access.

If you consider the replaceAll() is to replace all targeted sub-string,and this sub-stringhappens to be an empty string in this case, as long asString.substring(i, i+1) worksfor the index "i" which is in between of two paired surrogates, it ishard to argue that

you should not replace that "sub-string" with the replacement.

Another "excuse" to back my "not a bug" argument is that as the API docsuggeststhe String.replaceAll() is simply a wrapper of j.u.regex.Matcher, bothPattern andMatcher class all explicitly documents that they work on "CharSequence",which ispure char based and "does not" know code point (for now, with the newdefendermethod, it might be possible to evolute the CharSequence, if desirable,but thatwould be a totally different discussion). When I say "work onCharSequence", I'm

not saying the Pattern does/should not work on code point, actually the

implementation "transfers" all the String based pattern into a codepoint basedpattern before interpreting it, but all code/point (mainly forsupplementary) supportworks on the assumption that you have a "constructor". An empty stringpattern

really does not have any indication on "character".

No, I don't think we have any concrete reference as to how an emptystring regexshould be interpreted. By its nature, it should be the match of everypossiblezero-width position in the target CharSequence, which is char based uponthe API.

It might be desired to document this somewhere in the Pattern class.

-Sherman



On 06/08/2012 04:14 AM, Dawid Weiss wrote:

I guess a lot depends on the point of view. From historical point of
view (where a char[] and a String are basically unsigned values) that
pattern should simply process every value (index) and work like you
say. But from a practical point of view I think it is a bug -- it
corrupts the string, transforming legal unicode into invalid values.

I checked with Python (3) and the behavior there is the expected one
(it work at the unicode codepoint level rather than surrogate level).

Where is the behavior of "" that you mention defined? I admit I
couldn't find any reference to this in the documentation:

Using an empty String "" as a regex for the replaceAll() takes the
advantage of the special meaning of "", in which it is interpreted as
it can match any possible zero-width position of the target String

I'm not saying you're wrong (and that pattern is definitely not common
so it's probably academic discussion) but I'd like some concrete
reference as to how an empty pattern should behave. To me consistency
with the rest of the Pattern specification would be that it operates
at "zero width position between unicode characters" not between any
char[] value, even an incorrect one or in the middle of a surrogate.

Dawid

On Fri, Jun 8, 2012 at 12:46 AM, Xueming Shen<[email protected]>  wrote:

Personally I don't think it is a bug. A j.l.String represents a sequence of
UTF-16 chars. While
a pair of surrogates represents a supplementary character, a single
surrogate itself is still
a "legal" independent entity inside a String object and length of a String
is still defined as
the total number of char unit and an index value between a high surrogate
and a low
surrogate is still a legal index value that can be used to access the char
at that particular
position. Using an empty String "" as a regex for the replaceAll() takes the
advantage of the
special meaning of "", in which it is interpreted as it can match any
possible zero-width
position of the target String, it does  not imply anything regarding
"character"  or
"characters" around it, so I would not interpret it as a zero-with character
boundary,
therefor a "position" in between a pair surrogates is still a good "found"
for replacing.

-Sherman


On 6/7/2012 1:07 PM, Dawid Weiss wrote:

Hi, I'm a committer to the Apache Lucene project. We have randomized
tests and one seed hit the following (simplified) scenario:

    String s1 = "AB\uD840\uDC00C";
    String s2 = s1.replaceAll("", "X");

the input contains an extended unicode character (any surrogate pair
will do). The pattern is an empty string (in fact, it was randomized
as "]|" but it's the same problem so I omit the details). The problem
is that after applying this pattern, replaceAll inserts X in between
the surrogate pair characters and this results in invalid UTF-16:

AB𠀀C
XAXBX?X?XCX

I believe this is a bug in the regexp implementation (sorry, don't
have a patch for it) but I'd like to confirm it's not something known.
Pointers appreciated.

Dawid

Re: Empty regexp replaceall and surrogate pairs results in corrupted utf16.

Reply via email to