[Python-ideas] Re: prefix/suffix for bytes (was: New explicit methods to trim strings)

Chris Angelico Wed, 11 Mar 2020 03:32:24 -0700

On Wed, Mar 11, 2020 at 9:05 PM Steven D'Aprano <[email protected]> wrote:
> In practice, modern Unix shells and GUIs use UTF-8. UTF-8 has two nice
> properties:
>
> * Every ASCII character encodes to a single byte, so text which
>   only contains ASCII values encodes to precisely the same set
>   of bytes under UTF-8 as under ASCII.
>
> * No Unicode character, except for the Unicode NUL '\0', encodes
>   to a sequence containing a null byte.
>
> These properties are not an accident -- they were carefully designed
> that way.


The second of those is actually part of an even stronger guarantee: No
Unicode character except for an ASCII character encodes to a sequence
containing a byte less than 128. In other words, the ASCII characters
U+0000 to U+007F perfectly correspond to the byte values 0x00 to 0x7F,
and *no other UTF-8 sequence* will ever contain one of those byte
values.

This makes parsing an ASCII-only file format easy. You don't have to
worry about, for instance, finding a bye value 0x3C unless it
represents "<". (Though if you're taking a more generic boundary like
"whitespace", you'll need to cope with more than just bytes. But for
something like HTML, this is safe.)

Other ASCII-compatible encodings make the same guarantees, although a
lot of them do this by having only 128 non-ASCII characters available.

ChrisA
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/HPI3WQNAPGOKFS4S47NG27TCP6ARDSTL/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: prefix/suffix for bytes (was: New explicit methods to trim strings)

Reply via email to