On 05Feb2017 22:27, boB Stepp <robertvst...@gmail.com> wrote:
On Sun, Feb 5, 2017 at 7:23 PM, Steven D'Aprano <st...@pearwood.info> wrote:
Alternatively, you can embed it right in the string. For code points
between U+0000 and U+FFFF, use the \u escape, and for the rest, use \U

py> 'pi = \u03C0'  # requires exactly four hex digits
'pi = π'

py> 'pi = \U000003C0'  # requires exactly eight hex digits
'pi = π'

Lastly, you can use the code point's name:

'pi = π'

You have surprised me here by using single quotes to enclose the
entire assignment statements.  I thought this would throw a syntax
error, but it works just like you show.  What is going on here?

It's not an assignment statement. It's just a string. He's typing a string containing a \N{...} sequence and Python's printing that string back at you; pi's a printable character and gets displayed directly.

Try with this:

 py> 'here is a string\n\nline 3'

One last comment: Random832 said:
"Python 3 strings are unicode-unicode, not UTF-8."

If I recall what I originally wrote (and intended) I was merely
indicating I was happy with Python 3's default UTF-8 encoding.  I do
not know enough to know what these other UTF encodings offer.

From the outside (i.e. to your code) Python 3 strings are sequences of Unicode code points (characters, near enough). How they're _stored_ internally is not your problem:-) When you write a string to a file or the terminal etc, the string needs to be _encoded_ into a sequence of bytes (a sequence of bytes because there are more Unicode code points than can be expressed with one byte).

UTF-8 is by far the commonest such encoding in use. It has several nice characteristics: for one, the ASCII code points _are_ stored in a single byte. While that's nice for Western almost-only-speaking-English folks like me, it also means that the zillions of extisting ASCII text files don't need to be recoded to work in UTF-8. It has other cool features too.

To be pedantic, Unicode strings are sequences of abstract code points
("characters"). UTF-8 is a particular concrete implementation that is
used to store or transmit such code strings. Here are examples of three
possible encoding forms for the string 'πz':

UTF-16: either two, or four, bytes per character: 03C0 007A

UTF-32: exactly four bytes per character: 000003C0 0000007A

UTF-8: between one and four bytes per character: CF80 7A

I have not tallied up how many code points are actually assigned to
characters.  Does UTF-8 encoding currently cover all of them?  If yes,
why is there a need for other encodings?  Or by saying:

UTF-8 is variable length. You can leap into the middle of a UTF-8 string and resync (== find the first byte of the next character) thanks to its neat coding design, but you can't "seek" directly to the position of an arbitrarily numbered character (eg go to character 102345). By contract, UTF-32 is fixed length.

(UTF-16 and UTF-32 are hardware-dependent, and the byte order could be
reversed, e.g. C003 7A00. UTF-8 is not.)

do you mean that some hardware configurations require UTF-16 or UTF-32?

No, different machines order the bytes in a larger word in different orders. "Big endian" machines like SPARCs and M68k etc put the most significant bytes first; little endian machines put the least significant bytes first (eg Intel architecture machines). (Aside: the Alpha was switchable.)

So that "natural" way to write UTF-16 or UTF-32 might be big or little endian, and you need to know what was chosen for a given file.

Cameron Simpson <c...@zip.com.au>
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:

Reply via email to