On Wed, Mar 11, 2020 at 9:05 PM Steven D'Aprano <st...@pearwood.info> wrote:
> In practice, modern Unix shells and GUIs use UTF-8. UTF-8 has two nice
> properties:
>
> * Every ASCII character encodes to a single byte, so text which
>   only contains ASCII values encodes to precisely the same set
>   of bytes under UTF-8 as under ASCII.
>
> * No Unicode character, except for the Unicode NUL '\0', encodes
>   to a sequence containing a null byte.
>
> These properties are not an accident -- they were carefully designed
> that way.

The second of those is actually part of an even stronger guarantee: No
Unicode character except for an ASCII character encodes to a sequence
containing a byte less than 128. In other words, the ASCII characters
U+0000 to U+007F perfectly correspond to the byte values 0x00 to 0x7F,
and *no other UTF-8 sequence* will ever contain one of those byte
values.

This makes parsing an ASCII-only file format easy. You don't have to
worry about, for instance, finding a bye value 0x3C unless it
represents "<". (Though if you're taking a more generic boundary like
"whitespace", you'll need to cope with more than just bytes. But for
something like HTML, this is safe.)

Other ASCII-compatible encodings make the same guarantees, although a
lot of them do this by having only 128 non-ASCII characters available.

ChrisA
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/HPI3WQNAPGOKFS4S47NG27TCP6ARDSTL/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to