Tom Christiansen <tchr...@perl.com> added the comment: Ezio Melotti <rep...@bugs.python.org> wrote on Mon, 15 Aug 2011 04:56:55 -0000:
> Another thing I noticed is that (at least on wide builds) surrogate pairs are > not joined "on the fly": > >>> p > '\ud800\udc00' > >>> len(p) > 2 > >>> p.encode('utf-16').decode('utf-16') > '𐀀' > >>> len(_) > 1 (For those who may not immediately realize from reading the surrogates, '𐀀' is code point 0x10000, the first non-BMP code point. I piped it through `uniquote -x` just to make sure.) Yes, that makes perfect sense. It's something of a buggy feature or featureful bug that UTF-16 does this. When you are thinking of arbitrary sequences of code points, which is something you have be able to do in memory but not in a UTF stream, then one can say that one has four code points of anything in the 0 .. 0x10FFFF range. Those can be any arbitrary code points only (1) *while* in memory, *and* assuming a (2) non-UTF16, ie UTF-32 or UTF-8 representation. You cannot do that with UTF-16, which is why it works only on a Python wide build. Otherwise they join up. The reason they "join up" in UTF-16 is also the reason why unlike in regular memory where you might be able to use an alternate representation like UTF-8 or UTF-32, UTF streams cannot contain unpaired surrogates: because if that stream were in UTF-16, you would never be able to tell the difference between a sequence of a lead surrogate followed by a tail surrogate and the same thing meaning just one non-BMP code point. Since you would not be able to tell the difference, it always only means the latter, and the former sense is illegal. This is why lone surrogates are illegal in UTF streams. In case it isn't obvious, *this* is the source of the [𝒜--𝒵] bug in all the UTF-16 or UCS-2 regex languages. It is why Java 7 added \x{...}, so that they can rewrite that as [\x{1D49C}--\x{1D4B5}] to pass the regex compiler, so that it seems something indirect, not just surrogates. That's why I always check it in my cross-language regex tests. A 16-bit language has to have a workaround, somehow, or it will be in trouble. The Java regex compiler doesn't generate UTF-16 for itself, either. It generates UTF-32 for its pattern. You can see this right at the start of the source code. This is from the Java Pattern class: /** * Copies regular expression to an int array and invokes the parsing * of the expression which will create the object tree. */ private void compile() { // Handle canonical equivalences if (has(CANON_EQ) && !has(LITERAL)) { normalize(); } else { normalizedPattern = pattern; } patternLength = normalizedPattern.length(); // Copy pattern to int array for convenience // Use double zero to terminate pattern temp = new int[patternLength + 2]; hasSupplementary = false; int c, count = 0; // Convert all chars into code points for (int x = 0; x < patternLength; x += Character.charCount(c)) { c = normalizedPattern.codePointAt(x); if (isSupplementary(c)) { hasSupplementary = true; } temp[count++] = c; } patternLength = count; // patternLength now in code points See how that works? They use an int(-32) array, not a char(-16) array! It's reasonably clever, and necessary. Because it does that, it can now compile \x{1D49C} or erstwhile embedded UTF-8 non-BMP literals into UTF-32, and not get upset by the stormy sea of troubles that surrogates are. You can't have surrogates in ranges if you don't do something like this in a 16-bit language. Java couldn't fix the [𝒜--𝒵] bug except by doing the \x{...} indirection trick, because they are stuck with UTF-16. However, they actually can match the string "𝒜" against the pattern "^.$", and have it fail on "^..$". Yes, I know: the code-unit length of that string is 2, but its regex count is just one dot worth. I *believe* they did it that way because tr18 says it has to work that way, but they may also have done it just because it makes sense. My current contact at Oracle doing regex support is not the guy who originally wrote the class, so I am not sure. (He's very good, BTW. For Java 7, he also added named captures, script properties, *and* brought the class up to conformance with tr18's "level 1" requirements.) I'm thinking Python might be able to do in the regex engine on narrow builds the sort of thing that Java does. However, I am also thinking that that might be a lot of work for a situation more readily addressed by phasing out narrow builds or at least telling people they should use wide builds to get that thing to work. --tom ====================================================================== ============================================ ===> QUASI OFF TOPIC ADDENDUM FOLLOWS <=== ============================================ The rest of this message is just a comparison demo of how Perl treats the same sort of surrogate stuff that you were showing with Python. I'm not sure that it as relevant to your problem as the Java part just was, since we don't have UTF-16 the way Java does. One place it *may* be relevant to Python, which is the only reason I'm including it, is that it shows what sort of elastic boundaries you have with the old loose form utf8 that you no longer have with the current strict definition of UTF-8 from the Unicode Standard. It turns out that you can't play those surrogate halvsies games in Perl, because it thinks you're nuts. :) (Whether this is a bug or a feature may depend on where you're standing, but we're trying to conform with our reading of the Standard. Things like the Unicode Collation Algorithm have third-party conformance suites, but I don't know of any for encoders/decoders, so one is back to deciding whether one has correctly understood the Standard. As we all know, this isn't always all that easy.) This is the clincher at the far end, so you can quit reading now if you want: % perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}")' | uniquote -b UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158. \x00\x00 % perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}\x{DC00}")' | uniquote -b UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158. UTF-16 surrogate U+DC00 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158. \x00\x00\x00\x00 % perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}"), encode("UTF-16LE", "\x{DC00}")' | uniquote -b UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158. UTF-16 surrogate U+DC00 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158. \x00\x00\x00\x00 (Meanwhile, back to the quasi-off-topic monologue.) Oh, you can have isolated surrogates in memory just fine. We use abstract code points not code units, and since we aren't UTF-16, there's no problem there: % perl -le 'print length("\x{D800}")' 1 % perl -le 'print length("\x{D800}\x{DC00}")' 2 % perl -le 'print length("\x{DC00}\x{D800}")' 2 But you aren't allowed to *encode* those abstract characters into a UTF encoding form. (The -CS switch is like having PERLUNICODE=S or PYTHONIOENCODING=utf8) % perl -CS -le 'print "\x{D800}\x{DC00}"' Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1. Unicode surrogate U+DC00 is illegal in UTF-8 at -e line 1. ?????? Now, you might think my encoder did that, but it didn't. Those "??????" are is from my dumb Mac Terminal, *not* from the Perl utf8 encoder. This is yet another reason, BTW, why using "?" for the replacement for a non-encodable code point is strongly discouraged by Unicode. Otherwise it is too hard whether it is a "?" because it was in the original data or a "?" because it was a replacment char. Some of our encoders do use � FFFD REPLACEMENT CHARACTER * used to replace an incoming character whose value is unknown or unrepresentable in Unicode * compare the use of 001A as a control character to indicate the substitute function But that isn't triggering here. Run through uniquote shows that you got illegal surrogates in your (loose) utf8 stream: % perl -CS -le 'print "\x{D800}\x{DC00}"' | uniquote -x Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1. Unicode surrogate U+DC00 is illegal in UTF-8 at -e line 1. \x{D800}\x{DC00} Use -b on uniquote to get bytes not hex chars: % perl -CS -le 'print "\x{D800}\x{DC00}"' | uniquote -b Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1. Unicode surrogate U+DC00 is illegal in UTF-8 at -e line 1. \xED\xA0\x80\xED\xB0\x80 Yes, I'm not fond of the dribbled-out warning, either. It should of course be an exception. Part of what is going on here is that Perl's regular "utf8" encoding is the (I believe) the same loose one that Python also uses, the old one from before the restrictions. If you use the strict "UTF-8", then you get... % perl -le 'binmode(STDOUT, "encoding(UTF-8)"); print "\x{D800}\x{DC00}"' Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1. Unicode surrogate U+DC00 is illegal in UTF-8 at -e line 1. "\x{d800}" does not map to utf8. "\x{dc00}" does not map to utf8. \x{D800}\x{DC00} Which is pretty whacky. I'm *not* uniquoting that this time around, the encoder is. It did the \x{} substitution because it is absolutely forbidden from ever emitting surrogates in a valid strict UTF-8 stream, the way it will begrudgingly do in a loose utf8 stream (although the result is not valid as strict UTF-8). If you run with all warnings promoted into exceptions (as I almost always do), then you get more reasonable behavior, maybe. I'm going to use a command-line switch that's just like saying use warnings "FATAL" => "all". in the top-level scope. First with loose utf8: % perl -Mwarnings=FATAL,all -CS -le 'print "\x{D800}\x{DC00}"' Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1. Exit 255 And next with strict UTF-8: % perl -Mwarnings=FATAL,all -le 'binmode(STDOUT, "encoding(UTF-8)"); print "\x{D800}\x{DC00}"' Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1. Exit 25 So when you make them fatal, you get an exception -- which by not being caught, causes the program to exit. (The "Exit" is printed by my shell, and no, I'm sorry but I don't know why the status value differs. Well, kinda I know. It's because errno was 25 (ENOTTY) in the second case, but had been 0 in the first case so that turned into an unsigned char -1.) If you *promise* Perl you know what you're doing and full speed ahead, you can disable utf8 warnings, which will work to get you your two surrogates in the outbound *loose* utf8 stream, as six bytes that look like UTF-8 but actually "can't" be (called that) due to the Standard's rules: % perl -M-warnings=utf8 -CS -le 'print "\x{D800}\x{DC00}"' ?????? % perl -M-warnings=utf8 -CS -le 'print "\x{D800}\x{DC00}"' | uniquote -x \x{D800}\x{DC00} % perl -M-warnings=utf8 -CS -le 'print "\x{D800}\x{DC00}"' | uniquote -b \xED\xA0\x80\xED\xB0\x80 (The weird module import is like saying no warnings "utf8"; as the outer scope.) You can't get it with strict UTF-8 no matter how hard you try though. Only the loose one will put out invalid UTF-8. So what about UTF-16? Can we build things up piecemeal? Turns out that no, we can't. I *think* this is a feature, but I'm not 100.000% sure. % perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}")' | uniquote -b UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158. \x00\x00 % perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}\x{DC00}")' | uniquote -b UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158. UTF-16 surrogate U+DC00 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158. \x00\x00\x00\x00 % perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}"), encode("UTF-16LE", "\x{DC00}")' | uniquote -b UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158. UTF-16 surrogate U+DC00 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158. \x00\x00\x00\x00 You get all nulls because it refused to give you anything for surrogates. Although you can force it for UTF-8 if you use the loose utf8 version that is out of spec, there is just nothing you can to do encode into UTF-16 code points that are surrogates. Just can't. There is no loose UTF-16. Or that's what it looks like to me. You can have surrogates in memory, but those cannot make into a UTF-16 stream. Or a UTF-32 stream either. % perl -MEncode -le 'print encode("UTF-32LE", "\x{D800}")' | uniquote -b UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158. \x00\x00\x00\x00 You have to encode integral code points, not halvsies: % perl -C0 -MEncode -le 'print encode("UTF-16BE", "\x{10000}")' | uniquote -b \xD8\x00\xDC\x00 % perl -C0 -MEncode -le 'print encode("UTF-16LE", "\x{10000}")' | uniquote -b \x00\xD8\x00\xDC % perl -C0 -MEncode -le 'print encode("UTF-16", "\x{10000}")' | uniquote -b \xFE\xFF\xD8\x00\xDC\x00 % perl -C0 -MEncode -le 'print encode("UTF-32BE", "\x{10000}")' | uniquote -b \x00\x01\x00\x00 % perl -C0 -MEncode -le 'print encode("UTF-32LE", "\x{10000}")' | uniquote -b \x00\x00\x01\x00 % perl -C0 -MEncode -le 'print encode("UTF-32", "\x{10000}")' | uniquote -b \x00\x00\xFE\xFF\x00\x01\x00\x00 % perl -C0 -MEncode -le 'print encode("UTF-8", "\x{10000}")' | uniquote -b \xF0\x90\x80\x80 % perl -C0 -MEncode -le 'print encode("UTF-8", "\x{10000}")' | uniquote -x \x{10000} % perl -C0 -MEncode -le 'print encode("UTF-8", "\x{10000}")' 𐀀 % perl -CS -le 'print "\x{10000}" ' 𐀀 tom++ ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12729> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com