Re: [Tutor] sort() method and non-ASCII

Cameron Simpson Sun, 05 Feb 2017 20:52:44 -0800

On 05Feb2017 22:27, boB Stepp <robertvst...@gmail.com> wrote:

On Sun, Feb 5, 2017 at 7:23 PM, Steven D'Aprano <st...@pearwood.info> wrote:

Alternatively, you can embed it right in the string. For code points
between U+0000 and U+FFFF, use the \u escape, and for the rest, use \U
escapes:


py> 'pi = \u03C0'  # requires exactly four hex digits
'pi = π'

py> 'pi = \U000003C0'  # requires exactly eight hex digits
'pi = π'


Lastly, you can use the code point's name:

py> 'pi = \N{GREEK SMALL LETTER PI}'
'pi = π'


You have surprised me here by using single quotes to enclose the
entire assignment statements.  I thought this would throw a syntax
error, but it works just like you show.  What is going on here?

It's not an assignment statement. It's just a string. He's typing a stringcontaining a \N{...} sequence and Python's printing that string back at you;pi's a printable character and gets displayed directly.


Try with this:

 py> 'here is a string\n\nline 3'

One last comment: Random832 said:
"Python 3 strings are unicode-unicode, not UTF-8."


If I recall what I originally wrote (and intended) I was merely
indicating I was happy with Python 3's default UTF-8 encoding.  I do
not know enough to know what these other UTF encodings offer.

From the outside (i.e. to your code) Python 3 strings are sequences of Unicodecode points (characters, near enough). How they're _stored_ internally is notyour problem:-) When you write a string to a file or the terminal etc, thestring needs to be _encoded_ into a sequence of bytes (a sequence of bytesbecause there are more Unicode code points than can be expressed with onebyte).

UTF-8 is by far the commonest such encoding in use. It has several nicecharacteristics: for one, the ASCII code points _are_ stored in a single byte.While that's nice for Western almost-only-speaking-English folks like me, italso means that the zillions of extisting ASCII text files don't need to berecoded to work in UTF-8. It has other cool features too.

To be pedantic, Unicode strings are sequences of abstract code points
("characters"). UTF-8 is a particular concrete implementation that is
used to store or transmit such code strings. Here are examples of three
possible encoding forms for the string 'πz':

UTF-16: either two, or four, bytes per character: 03C0 007A

UTF-32: exactly four bytes per character: 000003C0 0000007A

UTF-8: between one and four bytes per character: CF80 7A


I have not tallied up how many code points are actually assigned to
characters.  Does UTF-8 encoding currently cover all of them?  If yes,
why is there a need for other encodings?  Or by saying:

UTF-8 is variable length. You can leap into the middle of a UTF-8 string andresync (== find the first byte of the next character) thanks to its neat codingdesign, but you can't "seek" directly to the position of an arbitrarilynumbered character (eg go to character 102345). By contract, UTF-32 is fixedlength.

(UTF-16 and UTF-32 are hardware-dependent, and the byte order could be
reversed, e.g. C003 7A00. UTF-8 is not.)


do you mean that some hardware configurations require UTF-16 or UTF-32?

No, different machines order the bytes in a larger word in different orders."Big endian" machines like SPARCs and M68k etc put the most significant bytesfirst; little endian machines put the least significant bytes first (eg Intelarchitecture machines). (Aside: the Alpha was switchable.)

So that "natural" way to write UTF-16 or UTF-32 might be big or little endian,and you need to know what was chosen for a given file.


Cheers,
Cameron Simpson <c...@zip.com.au>
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] sort() method and non-ASCII

Reply via email to