Re: [Tutor] sort() method and non-ASCII

2017-02-06 Thread Alan Gauld via Tutor
On 06/02/17 12:13, boB Stepp wrote:
 py> 'pi = \N{GREEK SMALL LETTER PI}'
 'pi = π'
>>>
>>>
>>> You have surprised me here by using single quotes to enclose the
>>> entire assignment statements.  I thought this would throw a syntax
>>> error, but it works just like you show.  What is going on here?

> I just came out of the shower this morning thinking, "Stupid boB,
> stupid.  That's just an escape sequence inside an overall string, not
> an assignment statement."  Duh!  My brain works better asleep than
> awake...

To be fair it did look like an assignment so when I first
saw it I too went "huh?!". But then I looked again and
figured it out. :-)

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] sort() method and non-ASCII

2017-02-06 Thread boB Stepp
On Sun, Feb 5, 2017 at 10:49 PM, Cameron Simpson  wrote:
> On 05Feb2017 22:27, boB Stepp  wrote:
>>
>> On Sun, Feb 5, 2017 at 7:23 PM, Steven D'Aprano 
>> wrote:
>>>
>>> Alternatively, you can embed it right in the string. For code points
>>> between U+ and U+, use the \u escape, and for the rest, use \U
>>> escapes:
>>>
>>> py> 'pi = \u03C0'  # requires exactly four hex digits
>>> 'pi = π'
>>>
>>> py> 'pi = \U03C0'  # requires exactly eight hex digits
>>> 'pi = π'
>>>
>>>
>>> Lastly, you can use the code point's name:
>>>
>>> py> 'pi = \N{GREEK SMALL LETTER PI}'
>>> 'pi = π'
>>
>>
>> You have surprised me here by using single quotes to enclose the
>> entire assignment statements.  I thought this would throw a syntax
>> error, but it works just like you show.  What is going on here?
>
>
> It's not an assignment statement. It's just a string. He's typing a string
> containing a \N{...} sequence and Python's printing that string back at you;
> pi's a printable character and gets displayed directly.

I just came out of the shower this morning thinking, "Stupid boB,
stupid.  That's just an escape sequence inside an overall string, not
an assignment statement."  Duh!  My brain works better asleep than
awake...

boB
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] sort() method and non-ASCII

2017-02-05 Thread Cameron Simpson

On 05Feb2017 22:27, boB Stepp  wrote:

On Sun, Feb 5, 2017 at 7:23 PM, Steven D'Aprano  wrote:

Alternatively, you can embed it right in the string. For code points
between U+ and U+, use the \u escape, and for the rest, use \U
escapes:

py> 'pi = \u03C0'  # requires exactly four hex digits
'pi = π'

py> 'pi = \U03C0'  # requires exactly eight hex digits
'pi = π'


Lastly, you can use the code point's name:

py> 'pi = \N{GREEK SMALL LETTER PI}'
'pi = π'


You have surprised me here by using single quotes to enclose the
entire assignment statements.  I thought this would throw a syntax
error, but it works just like you show.  What is going on here?


It's not an assignment statement. It's just a string. He's typing a string 
containing a \N{...} sequence and Python's printing that string back at you; 
pi's a printable character and gets displayed directly.


Try with this:

 py> 'here is a string\n\nline 3'


One last comment: Random832 said:
"Python 3 strings are unicode-unicode, not UTF-8."


If I recall what I originally wrote (and intended) I was merely
indicating I was happy with Python 3's default UTF-8 encoding.  I do
not know enough to know what these other UTF encodings offer.


From the outside (i.e. to your code) Python 3 strings are sequences of Unicode 
code points (characters, near enough). How they're _stored_ internally is not 
your problem:-) When you write a string to a file or the terminal etc, the 
string needs to be _encoded_ into a sequence of bytes (a sequence of bytes 
because there are more Unicode code points than can be expressed with one 
byte).


UTF-8 is by far the commonest such encoding in use. It has several nice 
characteristics: for one, the ASCII code points _are_ stored in a single byte.  
While that's nice for Western almost-only-speaking-English folks like me, it 
also means that the zillions of extisting ASCII text files don't need to be 
recoded to work in UTF-8. It has other cool features too.



To be pedantic, Unicode strings are sequences of abstract code points
("characters"). UTF-8 is a particular concrete implementation that is
used to store or transmit such code strings. Here are examples of three
possible encoding forms for the string 'πz':

UTF-16: either two, or four, bytes per character: 03C0 007A

UTF-32: exactly four bytes per character: 03C0 007A

UTF-8: between one and four bytes per character: CF80 7A


I have not tallied up how many code points are actually assigned to
characters.  Does UTF-8 encoding currently cover all of them?  If yes,
why is there a need for other encodings?  Or by saying:


UTF-8 is variable length. You can leap into the middle of a UTF-8 string and 
resync (== find the first byte of the next character) thanks to its neat coding 
design, but you can't "seek" directly to the position of an arbitrarily 
numbered character (eg go to character 102345). By contract, UTF-32 is fixed 
length.



(UTF-16 and UTF-32 are hardware-dependent, and the byte order could be
reversed, e.g. C003 7A00. UTF-8 is not.)


do you mean that some hardware configurations require UTF-16 or UTF-32?


No, different machines order the bytes in a larger word in different orders.  
"Big endian" machines like SPARCs and M68k etc put the most significant bytes 
first; little endian machines put the least significant bytes first (eg Intel 
architecture machines). (Aside: the Alpha was switchable.)


So that "natural" way to write UTF-16 or UTF-32 might be big or little endian, 
and you need to know what was chosen for a given file.


Cheers,
Cameron Simpson 
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] sort() method and non-ASCII

2017-02-05 Thread boB Stepp
On Sun, Feb 5, 2017 at 7:23 PM, Steven D'Aprano  wrote:
> On Sun, Feb 05, 2017 at 04:31:43PM -0600, boB Stepp wrote:
>> On Sat, Feb 4, 2017 at 10:50 PM, Random832  wrote:
>> > On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote:


> Alternatively, you can embed it right in the string. For code points
> between U+ and U+, use the \u escape, and for the rest, use \U
> escapes:
>
> py> 'pi = \u03C0'  # requires exactly four hex digits
> 'pi = π'
>
> py> 'pi = \U03C0'  # requires exactly eight hex digits
> 'pi = π'
>
>
> Lastly, you can use the code point's name:
>
> py> 'pi = \N{GREEK SMALL LETTER PI}'
> 'pi = π'

You have surprised me here by using single quotes to enclose the
entire assignment statements.  I thought this would throw a syntax
error, but it works just like you show.  What is going on here?

>
> One last comment: Random832 said:
>
> "Python 3 strings are unicode-unicode, not UTF-8."

If I recall what I originally wrote (and intended) I was merely
indicating I was happy with Python 3's default UTF-8 encoding.  I do
not know enough to know what these other UTF encodings offer.

> To be pedantic, Unicode strings are sequences of abstract code points
> ("characters"). UTF-8 is a particular concrete implementation that is
> used to store or transmit such code strings. Here are examples of three
> possible encoding forms for the string 'πz':
>
> UTF-16: either two, or four, bytes per character: 03C0 007A
>
> UTF-32: exactly four bytes per character: 03C0 007A
>
> UTF-8: between one and four bytes per character: CF80 7A

I have not tallied up how many code points are actually assigned to
characters.  Does UTF-8 encoding currently cover all of them?  If yes,
why is there a need for other encodings?  Or by saying:

> (UTF-16 and UTF-32 are hardware-dependent, and the byte order could be
> reversed, e.g. C003 7A00. UTF-8 is not.)

do you mean that some hardware configurations require UTF-16 or UTF-32?

Thank you (and the others in this thread) for taking the time to
clarify these matters.

-- 
boB
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] sort() method and non-ASCII

2017-02-05 Thread boB Stepp
On Sun, Feb 5, 2017 at 5:25 PM, Cameron Simpson  wrote:

> You might want to drop this term "hexadecimal"; they're just ordinals (plain
> old numbers). Though Unicode ordinals are often _written_ in hexadecimal for
> compactness and because various character grouping are aligned on ranges
> based on power-of-2 multiples. Like ASCII has the upper case latin alphabet
> at 64 (2^6) and lower case at 96 (2^6 + 2^32). Those values look rounder in
> base 16: 0x40 and 0x60.

I will endeavor to use "code points" instead.  I am just used to
seeing these charts/tables in hexadecimal values.



-- 
boB
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] sort() method and non-ASCII

2017-02-05 Thread Steven D'Aprano
On Sun, Feb 05, 2017 at 04:31:43PM -0600, boB Stepp wrote:
> On Sat, Feb 4, 2017 at 10:50 PM, Random832  wrote:
> > On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote:
> >> Does the list sort() method (and other sort methods in Python) just go
> >> by the hex value assigned to each symbol to determine sort order in
> >> whichever Unicode encoding chart is being implemented?
> >
> > By default. You need key=locale.strxfrm to make it do anything more
> > sophisticated.
> >
> > I'm not sure what you mean by "whichever unicode encoding chart". Python
> > 3 strings are unicode-unicode, not UTF-8.
> 
> As I said in my response to Steve just now:  I was looking at
> http://unicode.org/charts/  Because they called them charts, so did I.

Ah, that makes sense! They're just reference tables ("charts") for the 
convenience of people wishing to find particular characters.


> I'm assuming that despite this organization into charts, each and
> every character in each chart has its own unique hexadecimal code to
> designate each character.

Correct, although strictly speaking the codes are only conventionally 
given in hexadecimal. They are numbered from 0 to 1114111 in 
decimal (although not all codes are currently used).

The terminology used is that a "code point" is what I've been calling a 
"character", although not all code points are characters. Code points 
are usually written either as the character itself, e.g. 'A', or using 
the notation U+0041 where there are at least four and no more than six 
hexadecimal digits following the "U+". 

Bringing this back to Python, if you know the code point (as a number), 
you can use the chr() function to return it as a string:

py> chr(960)
'π'


Don't forget that Python understands hex too!

py> chr(0x03C0)  # better than chr(int('03C0', 16))
'π'


Alternatively, you can embed it right in the string. For code points 
between U+ and U+, use the \u escape, and for the rest, use \U 
escapes:

py> 'pi = \u03C0'  # requires exactly four hex digits
'pi = π'

py> 'pi = \U03C0'  # requires exactly eight hex digits
'pi = π'


Lastly, you can use the code point's name:

py> 'pi = \N{GREEK SMALL LETTER PI}'
'pi = π'


One last comment: Random832 said:

"Python 3 strings are unicode-unicode, not UTF-8."

To be pedantic, Unicode strings are sequences of abstract code points 
("characters"). UTF-8 is a particular concrete implementation that is 
used to store or transmit such code strings. Here are examples of three 
possible encoding forms for the string 'πz':

UTF-16: either two, or four, bytes per character: 03C0 007A

UTF-32: exactly four bytes per character: 03C0 007A

UTF-8: between one and four bytes per character: CF80 7A

(UTF-16 and UTF-32 are hardware-dependent, and the byte order could be 
reversed, e.g. C003 7A00. UTF-8 is not.)

Prior to version 3.3, there was a built-time option to select either 
"narrow" or "wide" Unicode strings. A narrow build used a fixed two 
bytes per code point, together with an incomplete and not quite correct 
scheme for using two code points together to represent the supplementary 
Unicode characters U+1 through U+10. (This is sometimes called 
UCS-2, sometimes UTF-16, but strictly speaking it is neither, or at 
least an incomplete and "buggy" implementation of UTF-16.)


-- 
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] sort() method and non-ASCII

2017-02-05 Thread eryk sun
On Sun, Feb 5, 2017 at 10:30 PM, boB Stepp  wrote:
> I was looking at http://unicode.org/charts/  Because they called them
> charts, so did I.  I'm assuming that despite this organization into
> charts, each and every character in each chart has its own unique
> hexadecimal code to designate each character.

Those are PDF charts (i.e. tables) for Unicode blocks:

https://en.wikipedia.org/wiki/Unicode_block

A Unicode block always has a multiple of 16 codepoints, so it's
convenient to represent the ordinal values in hexadecimal.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] sort() method and non-ASCII

2017-02-05 Thread Cameron Simpson

On 05Feb2017 16:31, boB Stepp  wrote:

On Sat, Feb 4, 2017 at 10:50 PM, Random832  wrote:

On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote:

Does the list sort() method (and other sort methods in Python) just go
by the hex value assigned to each symbol to determine sort order in
whichever Unicode encoding chart is being implemented?


By default. You need key=locale.strxfrm to make it do anything more
sophisticated.

I'm not sure what you mean by "whichever unicode encoding chart". Python
3 strings are unicode-unicode, not UTF-8.


As I said in my response to Steve just now:  I was looking at
http://unicode.org/charts/  Because they called them charts, so did I.
I'm assuming that despite this organization into charts, each and
every character in each chart has its own unique hexadecimal code to
designate each character.


You might want to drop this term "hexadecimal"; they're just ordinals (plain 
old numbers). Though Unicode ordinals are often _written_ in hexadecimal for 
compactness and because various character grouping are aligned on ranges based 
on power-of-2 multiples. Like ASCII has the upper case latin alphabet at 64 
(2^6) and lower case at 96 (2^6 + 2^32). Those values look rounder in base 16: 
0x40 and 0x60.


Cheers,
Cameron Simpson 
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] sort() method and non-ASCII

2017-02-05 Thread boB Stepp
On Sat, Feb 4, 2017 at 10:50 PM, Random832  wrote:
> On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote:
>> Does the list sort() method (and other sort methods in Python) just go
>> by the hex value assigned to each symbol to determine sort order in
>> whichever Unicode encoding chart is being implemented?
>
> By default. You need key=locale.strxfrm to make it do anything more
> sophisticated.
>
> I'm not sure what you mean by "whichever unicode encoding chart". Python
> 3 strings are unicode-unicode, not UTF-8.

As I said in my response to Steve just now:  I was looking at
http://unicode.org/charts/  Because they called them charts, so did I.
I'm assuming that despite this organization into charts, each and
every character in each chart has its own unique hexadecimal code to
designate each character.


-- 
boB
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] sort() method and non-ASCII

2017-02-05 Thread boB Stepp
On Sun, Feb 5, 2017 at 2:32 AM, Steven D'Aprano  wrote:
> On Sat, Feb 04, 2017 at 09:52:47PM -0600, boB Stepp wrote:
>> Does the list sort() method (and other sort methods in Python) just go
>> by the hex value assigned to each symbol to determine sort order in
>> whichever Unicode encoding chart is being implemented?
>
> Correct, except that there is only one Unicode encoding chart.

I was looking at http://unicode.org/charts/  Because they called them
charts, so did I.  I'm assuming that despite this organization into
charts, each and every character in each chart has its own unique
hexadecimal code to designate each character.



-- 
boB
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] sort() method and non-ASCII

2017-02-05 Thread Random832
On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote:
> Does the list sort() method (and other sort methods in Python) just go
> by the hex value assigned to each symbol to determine sort order in
> whichever Unicode encoding chart is being implemented?

By default. You need key=locale.strxfrm to make it do anything more
sophisticated.

I'm not sure what you mean by "whichever unicode encoding chart". Python
3 strings are unicode-unicode, not UTF-8.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] sort() method and non-ASCII

2017-02-05 Thread Steven D'Aprano
On Sat, Feb 04, 2017 at 09:52:47PM -0600, boB Stepp wrote:
> Does the list sort() method (and other sort methods in Python) just go
> by the hex value assigned to each symbol to determine sort order in
> whichever Unicode encoding chart is being implemented?

Correct, except that there is only one Unicode encoding chart.

You may be thinking of the legacy Windows "code pages" system, where you 
can change the code page to re-interpret characters as different 
characters. E.g. ð in code page 1252 (Western European) becomes π in 
code page 1253 (Greek).

Python supports encoding and decoding to and from legacy code page 
forms, but Unicode itself does away with the idea of using separate code 
pages. It effectively is a single, giant code page containing room for 
over a million characters. It's also a superset of ASCII, so pure ASCII 
text can be identical in Unicode.

Anyhoo, since Unicode supports dozens of languages from all over the 
world, it defines "collation rules" for sorting text in various 
languages. For example, sorting in Austria is different from sorting in 
Germany, despite them both using the same alphabet. Even in English, 
sorting rules can vary: some phone books sort Mc and Mac together, some 
don't.

However, Python doesn't directly support that. It just provides a single 
basic lexicographic sort based on the ord() of each character in the 
string.

> If yes, then
> my expectation would be that the French "á" would come after the "z"
> character. 

Correct:

py> "á" > "z"
True
py> sorted('áz')
['z', 'á']



-- 
Steve

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] sort() method and non-ASCII

2017-02-04 Thread eryk sun
On Sun, Feb 5, 2017 at 3:52 AM, boB Stepp  wrote:
> Does the list sort() method (and other sort methods in Python) just go
> by the hex value assigned to each symbol to determine sort order in
> whichever Unicode encoding chart is being implemented?

list.sort uses a less-than comparison. What you really want to know is
how Python compares strings. They're compared by ordinal at
corresponding indexes, i.e. ord(s1[i]) vs ord(s2[i]) for i less than
min(len(s1), len(s2)).

This gets a bit interesting when you're comparing characters that have
composed and decomposed Unicode forms, i.e. a single code vs multiple
combining codes. For example:

>>> s1 = '\xc7'
>>> s2 = 'C' + '\u0327'
>>> print(s1, s2)
Ç Ç
>>> s2 < s1
True

where U+0327 is a combining cedilla. As characters, s1 and s2 are the
same. However, codewise s2 is less than s1 because 0x43 ("C") is less
than 0xc7 ("Ç"). In this case you can first normalize the strings to
either composed or decomposed form [1]. For example:

>>> strings = ['\xc7', 'C\u0327', 'D']
>>> sorted(strings)
['Ç', 'D', 'Ç']

>>> norm_nfc = functools.partial(unicodedata.normalize, 'NFC')
>>> sorted(strings, key=norm_nfc)
['D', 'Ç', 'Ç']

[1]: https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


[Tutor] sort() method and non-ASCII

2017-02-04 Thread boB Stepp
Does the list sort() method (and other sort methods in Python) just go
by the hex value assigned to each symbol to determine sort order in
whichever Unicode encoding chart is being implemented?  If yes, then
my expectation would be that the French "á" would come after the "z"
character.  I am not ready to get into the guts of Unicode.  I am
quite happy now to leave Python 3 at its default UTF-8 and strictly
type in the ASCII subset of UTF-8.  But I know I will eventually have
to get into this, so I thought I would ask about sorting so I don't
get any evil surprises with some text file I might have to manipulate
in the future.

Thanks!

-- 
boB
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor