Re: Newbie question about text encoding

Marko Rauhamaa Sun, 08 Mar 2015 00:11:58 -0800

Chris Angelico <[email protected]>:

> Once again, you appear to be surprised that invalid data is failing.
> Why is this so strange? U+DD00 is not a valid character. It is quite
> correct to throw this error.


'\udd00' is a valid str object:

   >>> '\udd00'
   '\udd00'
   >>> '\udd00'.encode('utf-32')
   b'\xff\xfe\x00\x00\x00\xdd\x00\x00'
   >>> '\udd00'.encode('utf-16')
   b'\xff\xfe\x00\xdd'

I was simply stating that UTF-8 is not a bijection between unicode
strings and octet strings (even forgetting Python). Enriching Unicode
with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not
without side effects.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

Reply via email to