Re: [Tutor] sort() method and non-ASCII
On 06/02/17 12:13, boB Stepp wrote: py> 'pi = \N{GREEK SMALL LETTER PI}' 'pi = π' >>> >>> >>> You have surprised me here by using single quotes to enclose the >>> entire assignment statements. I thought this would throw a syntax >>> error, but it works just like you show. What is going on here? > I just came out of the shower this morning thinking, "Stupid boB, > stupid. That's just an escape sequence inside an overall string, not > an assignment statement." Duh! My brain works better asleep than > awake... To be fair it did look like an assignment so when I first saw it I too went "huh?!". But then I looked again and figured it out. :-) -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ http://www.amazon.com/author/alan_gauld Follow my photo-blog on Flickr at: http://www.flickr.com/photos/alangauldphotos ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sort() method and non-ASCII
On Sun, Feb 5, 2017 at 10:49 PM, Cameron Simpson wrote: > On 05Feb2017 22:27, boB Stepp wrote: >> >> On Sun, Feb 5, 2017 at 7:23 PM, Steven D'Aprano >> wrote: >>> >>> Alternatively, you can embed it right in the string. For code points >>> between U+ and U+, use the \u escape, and for the rest, use \U >>> escapes: >>> >>> py> 'pi = \u03C0' # requires exactly four hex digits >>> 'pi = π' >>> >>> py> 'pi = \U03C0' # requires exactly eight hex digits >>> 'pi = π' >>> >>> >>> Lastly, you can use the code point's name: >>> >>> py> 'pi = \N{GREEK SMALL LETTER PI}' >>> 'pi = π' >> >> >> You have surprised me here by using single quotes to enclose the >> entire assignment statements. I thought this would throw a syntax >> error, but it works just like you show. What is going on here? > > > It's not an assignment statement. It's just a string. He's typing a string > containing a \N{...} sequence and Python's printing that string back at you; > pi's a printable character and gets displayed directly. I just came out of the shower this morning thinking, "Stupid boB, stupid. That's just an escape sequence inside an overall string, not an assignment statement." Duh! My brain works better asleep than awake... boB ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sort() method and non-ASCII
On 05Feb2017 22:27, boB Stepp wrote: On Sun, Feb 5, 2017 at 7:23 PM, Steven D'Aprano wrote: Alternatively, you can embed it right in the string. For code points between U+ and U+, use the \u escape, and for the rest, use \U escapes: py> 'pi = \u03C0' # requires exactly four hex digits 'pi = π' py> 'pi = \U03C0' # requires exactly eight hex digits 'pi = π' Lastly, you can use the code point's name: py> 'pi = \N{GREEK SMALL LETTER PI}' 'pi = π' You have surprised me here by using single quotes to enclose the entire assignment statements. I thought this would throw a syntax error, but it works just like you show. What is going on here? It's not an assignment statement. It's just a string. He's typing a string containing a \N{...} sequence and Python's printing that string back at you; pi's a printable character and gets displayed directly. Try with this: py> 'here is a string\n\nline 3' One last comment: Random832 said: "Python 3 strings are unicode-unicode, not UTF-8." If I recall what I originally wrote (and intended) I was merely indicating I was happy with Python 3's default UTF-8 encoding. I do not know enough to know what these other UTF encodings offer. From the outside (i.e. to your code) Python 3 strings are sequences of Unicode code points (characters, near enough). How they're _stored_ internally is not your problem:-) When you write a string to a file or the terminal etc, the string needs to be _encoded_ into a sequence of bytes (a sequence of bytes because there are more Unicode code points than can be expressed with one byte). UTF-8 is by far the commonest such encoding in use. It has several nice characteristics: for one, the ASCII code points _are_ stored in a single byte. While that's nice for Western almost-only-speaking-English folks like me, it also means that the zillions of extisting ASCII text files don't need to be recoded to work in UTF-8. It has other cool features too. To be pedantic, Unicode strings are sequences of abstract code points ("characters"). UTF-8 is a particular concrete implementation that is used to store or transmit such code strings. Here are examples of three possible encoding forms for the string 'πz': UTF-16: either two, or four, bytes per character: 03C0 007A UTF-32: exactly four bytes per character: 03C0 007A UTF-8: between one and four bytes per character: CF80 7A I have not tallied up how many code points are actually assigned to characters. Does UTF-8 encoding currently cover all of them? If yes, why is there a need for other encodings? Or by saying: UTF-8 is variable length. You can leap into the middle of a UTF-8 string and resync (== find the first byte of the next character) thanks to its neat coding design, but you can't "seek" directly to the position of an arbitrarily numbered character (eg go to character 102345). By contract, UTF-32 is fixed length. (UTF-16 and UTF-32 are hardware-dependent, and the byte order could be reversed, e.g. C003 7A00. UTF-8 is not.) do you mean that some hardware configurations require UTF-16 or UTF-32? No, different machines order the bytes in a larger word in different orders. "Big endian" machines like SPARCs and M68k etc put the most significant bytes first; little endian machines put the least significant bytes first (eg Intel architecture machines). (Aside: the Alpha was switchable.) So that "natural" way to write UTF-16 or UTF-32 might be big or little endian, and you need to know what was chosen for a given file. Cheers, Cameron Simpson ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sort() method and non-ASCII
On Sun, Feb 5, 2017 at 7:23 PM, Steven D'Aprano wrote: > On Sun, Feb 05, 2017 at 04:31:43PM -0600, boB Stepp wrote: >> On Sat, Feb 4, 2017 at 10:50 PM, Random832 wrote: >> > On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote: > Alternatively, you can embed it right in the string. For code points > between U+ and U+, use the \u escape, and for the rest, use \U > escapes: > > py> 'pi = \u03C0' # requires exactly four hex digits > 'pi = π' > > py> 'pi = \U03C0' # requires exactly eight hex digits > 'pi = π' > > > Lastly, you can use the code point's name: > > py> 'pi = \N{GREEK SMALL LETTER PI}' > 'pi = π' You have surprised me here by using single quotes to enclose the entire assignment statements. I thought this would throw a syntax error, but it works just like you show. What is going on here? > > One last comment: Random832 said: > > "Python 3 strings are unicode-unicode, not UTF-8." If I recall what I originally wrote (and intended) I was merely indicating I was happy with Python 3's default UTF-8 encoding. I do not know enough to know what these other UTF encodings offer. > To be pedantic, Unicode strings are sequences of abstract code points > ("characters"). UTF-8 is a particular concrete implementation that is > used to store or transmit such code strings. Here are examples of three > possible encoding forms for the string 'πz': > > UTF-16: either two, or four, bytes per character: 03C0 007A > > UTF-32: exactly four bytes per character: 03C0 007A > > UTF-8: between one and four bytes per character: CF80 7A I have not tallied up how many code points are actually assigned to characters. Does UTF-8 encoding currently cover all of them? If yes, why is there a need for other encodings? Or by saying: > (UTF-16 and UTF-32 are hardware-dependent, and the byte order could be > reversed, e.g. C003 7A00. UTF-8 is not.) do you mean that some hardware configurations require UTF-16 or UTF-32? Thank you (and the others in this thread) for taking the time to clarify these matters. -- boB ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sort() method and non-ASCII
On Sun, Feb 5, 2017 at 5:25 PM, Cameron Simpson wrote: > You might want to drop this term "hexadecimal"; they're just ordinals (plain > old numbers). Though Unicode ordinals are often _written_ in hexadecimal for > compactness and because various character grouping are aligned on ranges > based on power-of-2 multiples. Like ASCII has the upper case latin alphabet > at 64 (2^6) and lower case at 96 (2^6 + 2^32). Those values look rounder in > base 16: 0x40 and 0x60. I will endeavor to use "code points" instead. I am just used to seeing these charts/tables in hexadecimal values. -- boB ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sort() method and non-ASCII
On Sun, Feb 05, 2017 at 04:31:43PM -0600, boB Stepp wrote: > On Sat, Feb 4, 2017 at 10:50 PM, Random832 wrote: > > On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote: > >> Does the list sort() method (and other sort methods in Python) just go > >> by the hex value assigned to each symbol to determine sort order in > >> whichever Unicode encoding chart is being implemented? > > > > By default. You need key=locale.strxfrm to make it do anything more > > sophisticated. > > > > I'm not sure what you mean by "whichever unicode encoding chart". Python > > 3 strings are unicode-unicode, not UTF-8. > > As I said in my response to Steve just now: I was looking at > http://unicode.org/charts/ Because they called them charts, so did I. Ah, that makes sense! They're just reference tables ("charts") for the convenience of people wishing to find particular characters. > I'm assuming that despite this organization into charts, each and > every character in each chart has its own unique hexadecimal code to > designate each character. Correct, although strictly speaking the codes are only conventionally given in hexadecimal. They are numbered from 0 to 1114111 in decimal (although not all codes are currently used). The terminology used is that a "code point" is what I've been calling a "character", although not all code points are characters. Code points are usually written either as the character itself, e.g. 'A', or using the notation U+0041 where there are at least four and no more than six hexadecimal digits following the "U+". Bringing this back to Python, if you know the code point (as a number), you can use the chr() function to return it as a string: py> chr(960) 'π' Don't forget that Python understands hex too! py> chr(0x03C0) # better than chr(int('03C0', 16)) 'π' Alternatively, you can embed it right in the string. For code points between U+ and U+, use the \u escape, and for the rest, use \U escapes: py> 'pi = \u03C0' # requires exactly four hex digits 'pi = π' py> 'pi = \U03C0' # requires exactly eight hex digits 'pi = π' Lastly, you can use the code point's name: py> 'pi = \N{GREEK SMALL LETTER PI}' 'pi = π' One last comment: Random832 said: "Python 3 strings are unicode-unicode, not UTF-8." To be pedantic, Unicode strings are sequences of abstract code points ("characters"). UTF-8 is a particular concrete implementation that is used to store or transmit such code strings. Here are examples of three possible encoding forms for the string 'πz': UTF-16: either two, or four, bytes per character: 03C0 007A UTF-32: exactly four bytes per character: 03C0 007A UTF-8: between one and four bytes per character: CF80 7A (UTF-16 and UTF-32 are hardware-dependent, and the byte order could be reversed, e.g. C003 7A00. UTF-8 is not.) Prior to version 3.3, there was a built-time option to select either "narrow" or "wide" Unicode strings. A narrow build used a fixed two bytes per code point, together with an incomplete and not quite correct scheme for using two code points together to represent the supplementary Unicode characters U+1 through U+10. (This is sometimes called UCS-2, sometimes UTF-16, but strictly speaking it is neither, or at least an incomplete and "buggy" implementation of UTF-16.) -- Steve ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sort() method and non-ASCII
On Sun, Feb 5, 2017 at 10:30 PM, boB Stepp wrote: > I was looking at http://unicode.org/charts/ Because they called them > charts, so did I. I'm assuming that despite this organization into > charts, each and every character in each chart has its own unique > hexadecimal code to designate each character. Those are PDF charts (i.e. tables) for Unicode blocks: https://en.wikipedia.org/wiki/Unicode_block A Unicode block always has a multiple of 16 codepoints, so it's convenient to represent the ordinal values in hexadecimal. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sort() method and non-ASCII
On 05Feb2017 16:31, boB Stepp wrote: On Sat, Feb 4, 2017 at 10:50 PM, Random832 wrote: On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote: Does the list sort() method (and other sort methods in Python) just go by the hex value assigned to each symbol to determine sort order in whichever Unicode encoding chart is being implemented? By default. You need key=locale.strxfrm to make it do anything more sophisticated. I'm not sure what you mean by "whichever unicode encoding chart". Python 3 strings are unicode-unicode, not UTF-8. As I said in my response to Steve just now: I was looking at http://unicode.org/charts/ Because they called them charts, so did I. I'm assuming that despite this organization into charts, each and every character in each chart has its own unique hexadecimal code to designate each character. You might want to drop this term "hexadecimal"; they're just ordinals (plain old numbers). Though Unicode ordinals are often _written_ in hexadecimal for compactness and because various character grouping are aligned on ranges based on power-of-2 multiples. Like ASCII has the upper case latin alphabet at 64 (2^6) and lower case at 96 (2^6 + 2^32). Those values look rounder in base 16: 0x40 and 0x60. Cheers, Cameron Simpson ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sort() method and non-ASCII
On Sat, Feb 4, 2017 at 10:50 PM, Random832 wrote: > On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote: >> Does the list sort() method (and other sort methods in Python) just go >> by the hex value assigned to each symbol to determine sort order in >> whichever Unicode encoding chart is being implemented? > > By default. You need key=locale.strxfrm to make it do anything more > sophisticated. > > I'm not sure what you mean by "whichever unicode encoding chart". Python > 3 strings are unicode-unicode, not UTF-8. As I said in my response to Steve just now: I was looking at http://unicode.org/charts/ Because they called them charts, so did I. I'm assuming that despite this organization into charts, each and every character in each chart has its own unique hexadecimal code to designate each character. -- boB ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sort() method and non-ASCII
On Sun, Feb 5, 2017 at 2:32 AM, Steven D'Aprano wrote: > On Sat, Feb 04, 2017 at 09:52:47PM -0600, boB Stepp wrote: >> Does the list sort() method (and other sort methods in Python) just go >> by the hex value assigned to each symbol to determine sort order in >> whichever Unicode encoding chart is being implemented? > > Correct, except that there is only one Unicode encoding chart. I was looking at http://unicode.org/charts/ Because they called them charts, so did I. I'm assuming that despite this organization into charts, each and every character in each chart has its own unique hexadecimal code to designate each character. -- boB ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sort() method and non-ASCII
On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote: > Does the list sort() method (and other sort methods in Python) just go > by the hex value assigned to each symbol to determine sort order in > whichever Unicode encoding chart is being implemented? By default. You need key=locale.strxfrm to make it do anything more sophisticated. I'm not sure what you mean by "whichever unicode encoding chart". Python 3 strings are unicode-unicode, not UTF-8. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sort() method and non-ASCII
On Sat, Feb 04, 2017 at 09:52:47PM -0600, boB Stepp wrote: > Does the list sort() method (and other sort methods in Python) just go > by the hex value assigned to each symbol to determine sort order in > whichever Unicode encoding chart is being implemented? Correct, except that there is only one Unicode encoding chart. You may be thinking of the legacy Windows "code pages" system, where you can change the code page to re-interpret characters as different characters. E.g. ð in code page 1252 (Western European) becomes π in code page 1253 (Greek). Python supports encoding and decoding to and from legacy code page forms, but Unicode itself does away with the idea of using separate code pages. It effectively is a single, giant code page containing room for over a million characters. It's also a superset of ASCII, so pure ASCII text can be identical in Unicode. Anyhoo, since Unicode supports dozens of languages from all over the world, it defines "collation rules" for sorting text in various languages. For example, sorting in Austria is different from sorting in Germany, despite them both using the same alphabet. Even in English, sorting rules can vary: some phone books sort Mc and Mac together, some don't. However, Python doesn't directly support that. It just provides a single basic lexicographic sort based on the ord() of each character in the string. > If yes, then > my expectation would be that the French "á" would come after the "z" > character. Correct: py> "á" > "z" True py> sorted('áz') ['z', 'á'] -- Steve ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sort() method and non-ASCII
On Sun, Feb 5, 2017 at 3:52 AM, boB Stepp wrote: > Does the list sort() method (and other sort methods in Python) just go > by the hex value assigned to each symbol to determine sort order in > whichever Unicode encoding chart is being implemented? list.sort uses a less-than comparison. What you really want to know is how Python compares strings. They're compared by ordinal at corresponding indexes, i.e. ord(s1[i]) vs ord(s2[i]) for i less than min(len(s1), len(s2)). This gets a bit interesting when you're comparing characters that have composed and decomposed Unicode forms, i.e. a single code vs multiple combining codes. For example: >>> s1 = '\xc7' >>> s2 = 'C' + '\u0327' >>> print(s1, s2) Ç Ç >>> s2 < s1 True where U+0327 is a combining cedilla. As characters, s1 and s2 are the same. However, codewise s2 is less than s1 because 0x43 ("C") is less than 0xc7 ("Ç"). In this case you can first normalize the strings to either composed or decomposed form [1]. For example: >>> strings = ['\xc7', 'C\u0327', 'D'] >>> sorted(strings) ['Ç', 'D', 'Ç'] >>> norm_nfc = functools.partial(unicodedata.normalize, 'NFC') >>> sorted(strings, key=norm_nfc) ['D', 'Ç', 'Ç'] [1]: https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
[Tutor] sort() method and non-ASCII
Does the list sort() method (and other sort methods in Python) just go by the hex value assigned to each symbol to determine sort order in whichever Unicode encoding chart is being implemented? If yes, then my expectation would be that the French "á" would come after the "z" character. I am not ready to get into the guts of Unicode. I am quite happy now to leave Python 3 at its default UTF-8 and strictly type in the ASCII subset of UTF-8. But I know I will eventually have to get into this, so I thought I would ask about sorting so I don't get any evil surprises with some text file I might have to manipulate in the future. Thanks! -- boB ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor