Re: [Python-Dev] Internal representation of strings and Micropython

Chris Angelico Wed, 04 Jun 2014 03:53:03 -0700

On Wed, Jun 4, 2014 at 8:38 PM, Paul Sokolovsky <pmis...@gmail.com> wrote:
> That's another reason why people don't like Unicode enforced upon them
> - all the talk about supporting all languages and scripts is demagogy
> and hypocrisy, given a choice, Unicode zealots would rather limit
> people to Latin script then give up on their arbitrarily chosen,
> one-among-thousands,
> soon-to-be-replaced-by-apples'-and-microsofts'-"exciting-new" encoding.


Wrong. I use and recommend Unicode, with UTF-8 for transmission, and I
do not ever want to limit people to Latin-1 or any other such subset.
Even though English is the only language I speak, I am *frequently*
using non-ASCII characters (eg when I discuss mathematics on a MUD),
and if I could be absolutely sure that everyone in the conversation
correctly comprehended Unicode, I could do this with a lot more
confidence. Unfortunately, the server I use just passes bytes in and
out, and some clients assume CP-1252, others assume Latin-1, and
others (including my Gypsum) try UTF-8 first and fall back on an
eight-bit encoding (currently CP-1252 because of the first group). But
in an ideal world, server and clients would all speak Unicode
everywhere, and transmit and receive UTF-8. This is not hypocrisy,
this is the way to work reliably.

> Once again, my claim is what MicroPython implements now is more correct
> - in a sense wider than technical - handling. We don't provide Unicode
> encoding support, because it's highly bloated, but let people use any
> encoding they like. That comes at some price, like length of strings in
> characters are not know to runtime, only in bytes, but quite a lot of
> applications can be written by having just that.

The current implementation is flat-out lying, actually. It claims that
it's storing Unicode codepoints (as per the Python spec) while
actually storing bytes, and then it transmits those bytes to the
console etc as-is. This is a bug. It needs to be fixed. The only
question is, what form will the fix take? Will it be PEP 393's
flexible fixed-width representation? UTF-8? UTF-16 (I hope not!)? A
hybrid of Latin-1 where possible and UTF-8 otherwise? But something
has to be done.

ChrisA
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Internal representation of strings and Micropython

Reply via email to