[issue36789] Unicode HOWTO incorrectly states that UTF-8 contains no zero bytes

2019-05-08 Thread mbiggs


mbiggs  added the comment:

Ah sent a pull request but didn't realize that redshiftzero already had.  Their 
PR looks good to me.

Thanks for fixing this!

--

___
Python tracker 
<https://bugs.python.org/issue36789>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36789] Unicode HOWTO incorrectly states that UTF-8 contains no zero bytes

2019-05-08 Thread mbiggs


Change by mbiggs :


--
pull_requests: +13102

___
Python tracker 
<https://bugs.python.org/issue36789>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36789] Unicode HOWTO incorrectly states that UTF-8 contains no zero bytes

2019-05-04 Thread mbiggs


mbiggs  added the comment:

So a correct statement would be "A UTF-8 string is turned into a sequence of 
bytes that contains embedded zero bytes only where they represent the NULL 
character (U+)."

I think it's important to correct this because the part about processing UTF-8 
with C functions like strcpy(), was wrong and could cause bugs.

--

___
Python tracker 
<https://bugs.python.org/issue36789>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue36789] Unicode HOWTO incorrectly states that UTF-8 contains no zero bytes

2019-05-03 Thread mbiggs

New submission from mbiggs :

In the Unicode HOWTO: http://docs.python.org/3.3/howto/unicode.html

It says the following:


"UTF-8 has several convenient properties:
(...)
2. A Unicode string is turned into a sequence of bytes containing no embedded 
zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be 
processed by C functions such as strcpy() and sent through protocols that can’t 
handle zero bytes."

This is not right.  UTF-8 uses the zero byte to represent the Unicode codepoint 
U+ (the ASCII NULL character).  This is a valid character in UTF-8 and is 
handled just fine by python's UTF-8 string encoding/decoding.

--
assignee: docs@python
components: Documentation
messages: 341363
nosy: docs@python, mbiggs
priority: normal
severity: normal
status: open
title: Unicode HOWTO incorrectly states that UTF-8 contains no zero bytes
versions: Python 2.7, Python 3.5, Python 3.6, Python 3.7, Python 3.8

___
Python tracker 
<https://bugs.python.org/issue36789>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com