Re: Newbie question about text encoding

Chris Angelico Sun, 08 Mar 2015 14:16:40 -0700

On Mon, Mar 9, 2015 at 5:25 AM, Steven D'Aprano
<[email protected]> wrote:
> Perhaps the bug is not UTF-8's inability to encode lone
> surrogates, but that Python allows you to create lone surrogates in the
> first place. That's not a rhetorical question. It's a genuine question.


As to the notion of rejecting the construction of strings containing
these invalid codepoints, I'm not sure. Are there any languages out
there that have a Unicode string type that requires that all
codepoints be valid (no surrogates, no U+FFFE, etc)? This is the kind
of thing that's usually done in an obscure language before it hits a
mainstream one.

Pike is similar to Python here. I can create a string with invalid
code points in it:

> "\uFFFE\uDD00";
(1) Result: "\ufffe\udd00"

but I can't UTF-8 encode that:

> string_to_utf8("\uFFFE\uDD00");
Character 0x0000dd00 at index 1 is in the surrogate range and therefore invalid.
Unknown program: string_to_utf8("\ufffe\udd00")
HilfeInput:1: HilfeInput()->___HilfeWrapper()

Or, using the streaming UTF-8 encoder instead of the short-hand:

> Charset.encoder("UTF-8")->feed("\uFFFE\uDD00")->drain();
Error encoding "\ufffe"[0xdd00] using utf8: Unsupported character 56576.
/usr/local/pike/8.1.0/lib/modules/_Charset.so:1:
    _Charset.UTF8enc()->feed("\ufffe\udd00")
HilfeInput:1: HilfeInput()->___HilfeWrapper()

Does anyone know of a language where you can't even construct the string?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

Reply via email to