Re: Glyphs and graphemes [was Re: Cult-like behaviour]

Marko Rauhamaa Tue, 17 Jul 2018 02:44:07 -0700

Chris Angelico <[email protected]>:

> On Tue, Jul 17, 2018 at 6:27 PM, Marko Rauhamaa <[email protected]> wrote:
>> Of course, UTF-8 doesn't relieve you from Unicode problems. But it has
>> one big advantage: it can usually deal with non-Unicode data without any
>> extra considerations while Python3's strings make you have to take
>> elaborate measures to handle those special cases. Why, even print() must
>> be guarded against UnicodeEncodeError when the printed string is not in
>> the programmer's control.
>
> What is this "non-Unicode data" that UTF-8 can handle? Do you mean
> arbitrary byte sequences? Because no, it cannot; properly-formed UTF-8
> sequences MUST comply with the precise requirements of the format.


I was being imprecise: byte strings carrying UTF-8 can handle bad UTF-8
with equal ease. And that's a real, practical advantage.

> Can you give an example of how Python 3's print function can raise
> UnicodeEncodeError when given a Python 3 string?

   >>> print("\ud810")
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   UnicodeEncodeError: 'utf-8' codec can't encode character '\ud810' \
   in position 0: surrogates not allowed


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Glyphs and graphemes [was Re: Cult-like behaviour]

Reply via email to