[issue20906] Issues in Unicode HOWTO

Graham Wideman Sun, 16 Mar 2014 17:17:48 -0700

Graham Wideman added the comment:

> Do you want to provide a patch?


I would be happy to, but I'm not currently set up to create a patch. Also, I 
hoped that an author who has more history with this article would supervise, 
especially where I don't know what the original intent was.

> I find use of the word "narrative" intimidating in the context of a technical 
> documentation.

Agreed. How about "In documentation such as the current article..."

> In general, I find it disappointing that the Unicode HOWTO only gives 
> hexadecimal representations of non-ASCII characters and (almost) never 
> represents them in their true form. This makes things more abstract 
> than necessary.

I concur with reducing unnecessary abstraction. No sure what you mean by "true 
form". Do you mean show the glyph which the code point represents? Or the 
sequence of bytes? Or display the code point value in decimal? 

> > This is a vague claim. Probably what was intended was: "Many 
> > Internet standards define protocols in which the data must 
> > contain no zero bytes, or zero bytes have special meaning."  
> > Is this actually true? Are there "many" such standards?

> I think it actually means that Internet protocols assume an ASCII-compatible 
> encoding (which UTF-8 is, but not UTF-16 or UTF-32 - nor EBCDIC :-)).

Ah -- yes that makes sense.

> > --> "Non-Unicode code systems usually don't handle all of 
> > the characters to be found in Unicode."

> The term *encoding* is used pervasively when dealing with the transformation 
> of unicode to/from bytes, so I find it confusing to introduce another term 
> here 
> ("code systems"). I prefer the original sentence.

I see that my revision missed the target. There is a problem, but it is wider 
than this sentence.

One of the most essential points this article should make clear is the 
distinction between older schemes with a single mapping:

Characters <--> numbers in particular binary format. (eg: ASCII)

... versus Unicode with two levels of mapping...

Characters <--> code point numbers <--> particular binary format of the number 
data and sequences thereof.

In the older schemes, "encoding" referred to the one mapping: chars <--> 
numbers in particular binary format. In Unicode, "encoding" refers only to the 
mapping: code point numbers <--> binary format. It does not refer to the chars 
<--> code point mapping. (At least, I think that's the case. Regardless, the 
two mappings need to be rigorously distinguished.)

On review, there are many points in the article that muddy this up.  For 
example, "Unicode started out using 16-bit characters instead of 8-bit 
characters". Saying "so-an-so-bit characters" about Unicode, in the current 
article, is either wrong, or very confusing.  Unicode characters are associated 
with code points, NOT with any _particular_ bit level representation.

If I'm right about the preceding, then it would be good for that to be spelled 
out more explicitly, and used consistently throughout the article. (I won't try 
to list all the examples of this problem here -- too messy.)

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue20906>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue20906] Issues in Unicode HOWTO

Reply via email to