Chris Angelico <ros...@gmail.com>: > On Tue, Jul 17, 2018 at 6:27 PM, Marko Rauhamaa <ma...@pacujo.net> wrote: >> Of course, UTF-8 doesn't relieve you from Unicode problems. But it has >> one big advantage: it can usually deal with non-Unicode data without any >> extra considerations while Python3's strings make you have to take >> elaborate measures to handle those special cases. Why, even print() must >> be guarded against UnicodeEncodeError when the printed string is not in >> the programmer's control. > > What is this "non-Unicode data" that UTF-8 can handle? Do you mean > arbitrary byte sequences? Because no, it cannot; properly-formed UTF-8 > sequences MUST comply with the precise requirements of the format.
I was being imprecise: byte strings carrying UTF-8 can handle bad UTF-8 with equal ease. And that's a real, practical advantage. > Can you give an example of how Python 3's print function can raise > UnicodeEncodeError when given a Python 3 string? >>> print("\ud810") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud810' \ in position 0: surrogates not allowed Marko -- https://mail.python.org/mailman/listinfo/python-list