Re: [Tutor] sort() method and non-ASCII
On 05Feb2017 22:27, boB Stepp wrote: On Sun, Feb 5, 2017 at 7:23 PM, Steven D'Aprano wrote: Alternatively, you can embed it right in the string. For code points between U+ and U+, use the \u escape, and for the rest, use \U escapes: py> 'pi = \u03C0' # requires exactly four hex digits 'pi = π' py> 'pi = \U03C0' # requires exactly eight hex digits 'pi = π' Lastly, you can use the code point's name: py> 'pi = \N{GREEK SMALL LETTER PI}' 'pi = π' You have surprised me here by using single quotes to enclose the entire assignment statements. I thought this would throw a syntax error, but it works just like you show. What is going on here? It's not an assignment statement. It's just a string. He's typing a string containing a \N{...} sequence and Python's printing that string back at you; pi's a printable character and gets displayed directly. Try with this: py> 'here is a string\n\nline 3' One last comment: Random832 said: "Python 3 strings are unicode-unicode, not UTF-8." If I recall what I originally wrote (and intended) I was merely indicating I was happy with Python 3's default UTF-8 encoding. I do not know enough to know what these other UTF encodings offer. From the outside (i.e. to your code) Python 3 strings are sequences of Unicode code points (characters, near enough). How they're _stored_ internally is not your problem:-) When you write a string to a file or the terminal etc, the string needs to be _encoded_ into a sequence of bytes (a sequence of bytes because there are more Unicode code points than can be expressed with one byte). UTF-8 is by far the commonest such encoding in use. It has several nice characteristics: for one, the ASCII code points _are_ stored in a single byte. While that's nice for Western almost-only-speaking-English folks like me, it also means that the zillions of extisting ASCII text files don't need to be recoded to work in UTF-8. It has other cool features too. To be pedantic, Unicode strings are sequences of abstract code points ("characters"). UTF-8 is a particular concrete implementation that is used to store or transmit such code strings. Here are examples of three possible encoding forms for the string 'πz': UTF-16: either two, or four, bytes per character: 03C0 007A UTF-32: exactly four bytes per character: 03C0 007A UTF-8: between one and four bytes per character: CF80 7A I have not tallied up how many code points are actually assigned to characters. Does UTF-8 encoding currently cover all of them? If yes, why is there a need for other encodings? Or by saying: UTF-8 is variable length. You can leap into the middle of a UTF-8 string and resync (== find the first byte of the next character) thanks to its neat coding design, but you can't "seek" directly to the position of an arbitrarily numbered character (eg go to character 102345). By contract, UTF-32 is fixed length. (UTF-16 and UTF-32 are hardware-dependent, and the byte order could be reversed, e.g. C003 7A00. UTF-8 is not.) do you mean that some hardware configurations require UTF-16 or UTF-32? No, different machines order the bytes in a larger word in different orders. "Big endian" machines like SPARCs and M68k etc put the most significant bytes first; little endian machines put the least significant bytes first (eg Intel architecture machines). (Aside: the Alpha was switchable.) So that "natural" way to write UTF-16 or UTF-32 might be big or little endian, and you need to know what was chosen for a given file. Cheers, Cameron Simpson ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sort() method and non-ASCII
On Sun, Feb 5, 2017 at 7:23 PM, Steven D'Aprano wrote: > On Sun, Feb 05, 2017 at 04:31:43PM -0600, boB Stepp wrote: >> On Sat, Feb 4, 2017 at 10:50 PM, Random832 wrote: >> > On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote: > Alternatively, you can embed it right in the string. For code points > between U+ and U+, use the \u escape, and for the rest, use \U > escapes: > > py> 'pi = \u03C0' # requires exactly four hex digits > 'pi = π' > > py> 'pi = \U03C0' # requires exactly eight hex digits > 'pi = π' > > > Lastly, you can use the code point's name: > > py> 'pi = \N{GREEK SMALL LETTER PI}' > 'pi = π' You have surprised me here by using single quotes to enclose the entire assignment statements. I thought this would throw a syntax error, but it works just like you show. What is going on here? > > One last comment: Random832 said: > > "Python 3 strings are unicode-unicode, not UTF-8." If I recall what I originally wrote (and intended) I was merely indicating I was happy with Python 3's default UTF-8 encoding. I do not know enough to know what these other UTF encodings offer. > To be pedantic, Unicode strings are sequences of abstract code points > ("characters"). UTF-8 is a particular concrete implementation that is > used to store or transmit such code strings. Here are examples of three > possible encoding forms for the string 'πz': > > UTF-16: either two, or four, bytes per character: 03C0 007A > > UTF-32: exactly four bytes per character: 03C0 007A > > UTF-8: between one and four bytes per character: CF80 7A I have not tallied up how many code points are actually assigned to characters. Does UTF-8 encoding currently cover all of them? If yes, why is there a need for other encodings? Or by saying: > (UTF-16 and UTF-32 are hardware-dependent, and the byte order could be > reversed, e.g. C003 7A00. UTF-8 is not.) do you mean that some hardware configurations require UTF-16 or UTF-32? Thank you (and the others in this thread) for taking the time to clarify these matters. -- boB ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sort() method and non-ASCII
On Sun, Feb 5, 2017 at 5:25 PM, Cameron Simpson wrote: > You might want to drop this term "hexadecimal"; they're just ordinals (plain > old numbers). Though Unicode ordinals are often _written_ in hexadecimal for > compactness and because various character grouping are aligned on ranges > based on power-of-2 multiples. Like ASCII has the upper case latin alphabet > at 64 (2^6) and lower case at 96 (2^6 + 2^32). Those values look rounder in > base 16: 0x40 and 0x60. I will endeavor to use "code points" instead. I am just used to seeing these charts/tables in hexadecimal values. -- boB ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sort() method and non-ASCII
On Sun, Feb 05, 2017 at 04:31:43PM -0600, boB Stepp wrote: > On Sat, Feb 4, 2017 at 10:50 PM, Random832 wrote: > > On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote: > >> Does the list sort() method (and other sort methods in Python) just go > >> by the hex value assigned to each symbol to determine sort order in > >> whichever Unicode encoding chart is being implemented? > > > > By default. You need key=locale.strxfrm to make it do anything more > > sophisticated. > > > > I'm not sure what you mean by "whichever unicode encoding chart". Python > > 3 strings are unicode-unicode, not UTF-8. > > As I said in my response to Steve just now: I was looking at > http://unicode.org/charts/ Because they called them charts, so did I. Ah, that makes sense! They're just reference tables ("charts") for the convenience of people wishing to find particular characters. > I'm assuming that despite this organization into charts, each and > every character in each chart has its own unique hexadecimal code to > designate each character. Correct, although strictly speaking the codes are only conventionally given in hexadecimal. They are numbered from 0 to 1114111 in decimal (although not all codes are currently used). The terminology used is that a "code point" is what I've been calling a "character", although not all code points are characters. Code points are usually written either as the character itself, e.g. 'A', or using the notation U+0041 where there are at least four and no more than six hexadecimal digits following the "U+". Bringing this back to Python, if you know the code point (as a number), you can use the chr() function to return it as a string: py> chr(960) 'π' Don't forget that Python understands hex too! py> chr(0x03C0) # better than chr(int('03C0', 16)) 'π' Alternatively, you can embed it right in the string. For code points between U+ and U+, use the \u escape, and for the rest, use \U escapes: py> 'pi = \u03C0' # requires exactly four hex digits 'pi = π' py> 'pi = \U03C0' # requires exactly eight hex digits 'pi = π' Lastly, you can use the code point's name: py> 'pi = \N{GREEK SMALL LETTER PI}' 'pi = π' One last comment: Random832 said: "Python 3 strings are unicode-unicode, not UTF-8." To be pedantic, Unicode strings are sequences of abstract code points ("characters"). UTF-8 is a particular concrete implementation that is used to store or transmit such code strings. Here are examples of three possible encoding forms for the string 'πz': UTF-16: either two, or four, bytes per character: 03C0 007A UTF-32: exactly four bytes per character: 03C0 007A UTF-8: between one and four bytes per character: CF80 7A (UTF-16 and UTF-32 are hardware-dependent, and the byte order could be reversed, e.g. C003 7A00. UTF-8 is not.) Prior to version 3.3, there was a built-time option to select either "narrow" or "wide" Unicode strings. A narrow build used a fixed two bytes per code point, together with an incomplete and not quite correct scheme for using two code points together to represent the supplementary Unicode characters U+1 through U+10. (This is sometimes called UCS-2, sometimes UTF-16, but strictly speaking it is neither, or at least an incomplete and "buggy" implementation of UTF-16.) -- Steve ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sort() method and non-ASCII
On Sun, Feb 5, 2017 at 10:30 PM, boB Stepp wrote: > I was looking at http://unicode.org/charts/ Because they called them > charts, so did I. I'm assuming that despite this organization into > charts, each and every character in each chart has its own unique > hexadecimal code to designate each character. Those are PDF charts (i.e. tables) for Unicode blocks: https://en.wikipedia.org/wiki/Unicode_block A Unicode block always has a multiple of 16 codepoints, so it's convenient to represent the ordinal values in hexadecimal. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sort() method and non-ASCII
On 05Feb2017 16:31, boB Stepp wrote: On Sat, Feb 4, 2017 at 10:50 PM, Random832 wrote: On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote: Does the list sort() method (and other sort methods in Python) just go by the hex value assigned to each symbol to determine sort order in whichever Unicode encoding chart is being implemented? By default. You need key=locale.strxfrm to make it do anything more sophisticated. I'm not sure what you mean by "whichever unicode encoding chart". Python 3 strings are unicode-unicode, not UTF-8. As I said in my response to Steve just now: I was looking at http://unicode.org/charts/ Because they called them charts, so did I. I'm assuming that despite this organization into charts, each and every character in each chart has its own unique hexadecimal code to designate each character. You might want to drop this term "hexadecimal"; they're just ordinals (plain old numbers). Though Unicode ordinals are often _written_ in hexadecimal for compactness and because various character grouping are aligned on ranges based on power-of-2 multiples. Like ASCII has the upper case latin alphabet at 64 (2^6) and lower case at 96 (2^6 + 2^32). Those values look rounder in base 16: 0x40 and 0x60. Cheers, Cameron Simpson ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sort() method and non-ASCII
On Sat, Feb 4, 2017 at 10:50 PM, Random832 wrote: > On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote: >> Does the list sort() method (and other sort methods in Python) just go >> by the hex value assigned to each symbol to determine sort order in >> whichever Unicode encoding chart is being implemented? > > By default. You need key=locale.strxfrm to make it do anything more > sophisticated. > > I'm not sure what you mean by "whichever unicode encoding chart". Python > 3 strings are unicode-unicode, not UTF-8. As I said in my response to Steve just now: I was looking at http://unicode.org/charts/ Because they called them charts, so did I. I'm assuming that despite this organization into charts, each and every character in each chart has its own unique hexadecimal code to designate each character. -- boB ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sort() method and non-ASCII
On Sun, Feb 5, 2017 at 2:32 AM, Steven D'Aprano wrote: > On Sat, Feb 04, 2017 at 09:52:47PM -0600, boB Stepp wrote: >> Does the list sort() method (and other sort methods in Python) just go >> by the hex value assigned to each symbol to determine sort order in >> whichever Unicode encoding chart is being implemented? > > Correct, except that there is only one Unicode encoding chart. I was looking at http://unicode.org/charts/ Because they called them charts, so did I. I'm assuming that despite this organization into charts, each and every character in each chart has its own unique hexadecimal code to designate each character. -- boB ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Syntax error while attempting to type in multiline statements in the interactive interpreter
On 05/02/17 01:29, boB Stepp wrote: > But it seems to me on further thought that both REPL and what seems > most consistent to me, "...wait until all the input has been read, > then evaluate it all..." amounts to the same thing in the case of > entering function definitions into the interpreter. Nope, the function definition is a single executable statement. def is a command just like print. The command is to compile the block of code and store it as a function object that can later be called. So the interpreter is being completely consistent in looking for the full definition in that case. You are right that the interactive interpreter *could* have read all input before executing it, but by design it doesn't. It is an arbitrary choice to make the interpreter act on a single input statement at a time (and personally I think a good one). The problem with executing multiple statements at a time is that your mistakes are often not where you think they are. And by showing you the output of every statement as you go you notice the error as it happens and don't mistakenly make assumptions about which of the previous N statements contains the bug. So personally I prefer the Python style interpreter. Perl by contrast is more like your preference and interprets to an EOF and I find that too easy to make mistakes. Best of all I guess is Smalltalk which executes any code you highlight with a mouse... -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ http://www.amazon.com/author/alan_gauld Follow my photo-blog on Flickr at: http://www.flickr.com/photos/alangauldphotos ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sort() method and non-ASCII
On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote: > Does the list sort() method (and other sort methods in Python) just go > by the hex value assigned to each symbol to determine sort order in > whichever Unicode encoding chart is being implemented? By default. You need key=locale.strxfrm to make it do anything more sophisticated. I'm not sure what you mean by "whichever unicode encoding chart". Python 3 strings are unicode-unicode, not UTF-8. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] sort() method and non-ASCII
On Sat, Feb 04, 2017 at 09:52:47PM -0600, boB Stepp wrote: > Does the list sort() method (and other sort methods in Python) just go > by the hex value assigned to each symbol to determine sort order in > whichever Unicode encoding chart is being implemented? Correct, except that there is only one Unicode encoding chart. You may be thinking of the legacy Windows "code pages" system, where you can change the code page to re-interpret characters as different characters. E.g. ð in code page 1252 (Western European) becomes π in code page 1253 (Greek). Python supports encoding and decoding to and from legacy code page forms, but Unicode itself does away with the idea of using separate code pages. It effectively is a single, giant code page containing room for over a million characters. It's also a superset of ASCII, so pure ASCII text can be identical in Unicode. Anyhoo, since Unicode supports dozens of languages from all over the world, it defines "collation rules" for sorting text in various languages. For example, sorting in Austria is different from sorting in Germany, despite them both using the same alphabet. Even in English, sorting rules can vary: some phone books sort Mc and Mac together, some don't. However, Python doesn't directly support that. It just provides a single basic lexicographic sort based on the ord() of each character in the string. > If yes, then > my expectation would be that the French "á" would come after the "z" > character. Correct: py> "á" > "z" True py> sorted('áz') ['z', 'á'] -- Steve ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Function annotations
On Sat, Feb 04, 2017 at 10:11:39PM -0600, boB Stepp wrote: > Are the people making linters implementing checking function > annotations? Or is this something only gradually being adopted? Depends which linter :-) MyPy is still the reference implementation for type hinting in Python: http://mypy-lang.org/ As for pychecker, pylint, jedi, pyflakes, etc you'll have to check with the individual application itself. > Steve, are you making use of function annotations? If yes, are you > finding them worth the extra effort? Not yet. -- Steve ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor