On Mon, Mar 9, 2015 at 5:25 AM, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > Perhaps the bug is not UTF-8's inability to encode lone > surrogates, but that Python allows you to create lone surrogates in the > first place. That's not a rhetorical question. It's a genuine question.
As to the notion of rejecting the construction of strings containing these invalid codepoints, I'm not sure. Are there any languages out there that have a Unicode string type that requires that all codepoints be valid (no surrogates, no U+FFFE, etc)? This is the kind of thing that's usually done in an obscure language before it hits a mainstream one. Pike is similar to Python here. I can create a string with invalid code points in it: > "\uFFFE\uDD00"; (1) Result: "\ufffe\udd00" but I can't UTF-8 encode that: > string_to_utf8("\uFFFE\uDD00"); Character 0x0000dd00 at index 1 is in the surrogate range and therefore invalid. Unknown program: string_to_utf8("\ufffe\udd00") HilfeInput:1: HilfeInput()->___HilfeWrapper() Or, using the streaming UTF-8 encoder instead of the short-hand: > Charset.encoder("UTF-8")->feed("\uFFFE\uDD00")->drain(); Error encoding "\ufffe"[0xdd00] using utf8: Unsupported character 56576. /usr/local/pike/8.1.0/lib/modules/_Charset.so:1: _Charset.UTF8enc()->feed("\ufffe\udd00") HilfeInput:1: HilfeInput()->___HilfeWrapper() Does anyone know of a language where you can't even construct the string? ChrisA -- https://mail.python.org/mailman/listinfo/python-list