date:20170205

Re: [Tutor] sort() method and non-ASCII

2017-02-05 Thread Cameron Simpson


On 05Feb2017 22:27, boB Stepp  wrote:

On Sun, Feb 5, 2017 at 7:23 PM, Steven D'Aprano  wrote:

Alternatively, you can embed it right in the string. For code points
between U+ and U+, use the \u escape, and for the rest, use \U
escapes:

py> 'pi = \u03C0'  # requires exactly four hex digits
'pi = π'

py> 'pi = \U03C0'  # requires exactly eight hex digits
'pi = π'


Lastly, you can use the code point's name:

py> 'pi = \N{GREEK SMALL LETTER PI}'
'pi = π'


You have surprised me here by using single quotes to enclose the
entire assignment statements.  I thought this would throw a syntax
error, but it works just like you show.  What is going on here?


It's not an assignment statement. It's just a string. He's typing a string 
containing a \N{...} sequence and Python's printing that string back at you; 
pi's a printable character and gets displayed directly.


Try with this:

 py> 'here is a string\n\nline 3'


One last comment: Random832 said:
"Python 3 strings are unicode-unicode, not UTF-8."


If I recall what I originally wrote (and intended) I was merely
indicating I was happy with Python 3's default UTF-8 encoding.  I do
not know enough to know what these other UTF encodings offer.


From the outside (i.e. to your code) Python 3 strings are sequences of Unicode 
code points (characters, near enough). How they're _stored_ internally is not 
your problem:-) When you write a string to a file or the terminal etc, the 
string needs to be _encoded_ into a sequence of bytes (a sequence of bytes 
because there are more Unicode code points than can be expressed with one 
byte).


UTF-8 is by far the commonest such encoding in use. It has several nice 
characteristics: for one, the ASCII code points _are_ stored in a single byte.  
While that's nice for Western almost-only-speaking-English folks like me, it 
also means that the zillions of extisting ASCII text files don't need to be 
recoded to work in UTF-8. It has other cool features too.



To be pedantic, Unicode strings are sequences of abstract code points
("characters"). UTF-8 is a particular concrete implementation that is
used to store or transmit such code strings. Here are examples of three
possible encoding forms for the string 'πz':

UTF-16: either two, or four, bytes per character: 03C0 007A

UTF-32: exactly four bytes per character: 03C0 007A

UTF-8: between one and four bytes per character: CF80 7A


I have not tallied up how many code points are actually assigned to
characters.  Does UTF-8 encoding currently cover all of them?  If yes,
why is there a need for other encodings?  Or by saying:


UTF-8 is variable length. You can leap into the middle of a UTF-8 string and 
resync (== find the first byte of the next character) thanks to its neat coding 
design, but you can't "seek" directly to the position of an arbitrarily 
numbered character (eg go to character 102345). By contract, UTF-32 is fixed 
length.



(UTF-16 and UTF-32 are hardware-dependent, and the byte order could be
reversed, e.g. C003 7A00. UTF-8 is not.)


do you mean that some hardware configurations require UTF-16 or UTF-32?


No, different machines order the bytes in a larger word in different orders.  
"Big endian" machines like SPARCs and M68k etc put the most significant bytes 
first; little endian machines put the least significant bytes first (eg Intel 
architecture machines). (Aside: the Alpha was switchable.)


So that "natural" way to write UTF-16 or UTF-32 might be big or little endian, 
and you need to know what was chosen for a given file.


Cheers,
Cameron Simpson 
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] sort() method and non-ASCII

2017-02-05 Thread boB Stepp

On Sun, Feb 5, 2017 at 7:23 PM, Steven D'Aprano  wrote:
> On Sun, Feb 05, 2017 at 04:31:43PM -0600, boB Stepp wrote:
>> On Sat, Feb 4, 2017 at 10:50 PM, Random832  wrote:
>> > On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote:

> Alternatively, you can embed it right in the string. For code points
> between U+ and U+, use the \u escape, and for the rest, use \U
> escapes:
>
> py> 'pi = \u03C0'  # requires exactly four hex digits
> 'pi = π'
>
> py> 'pi = \U03C0'  # requires exactly eight hex digits
> 'pi = π'
>
>
> Lastly, you can use the code point's name:
>
> py> 'pi = \N{GREEK SMALL LETTER PI}'
> 'pi = π'

You have surprised me here by using single quotes to enclose the
entire assignment statements.  I thought this would throw a syntax
error, but it works just like you show.  What is going on here?

>
> One last comment: Random832 said:
>
> "Python 3 strings are unicode-unicode, not UTF-8."

If I recall what I originally wrote (and intended) I was merely
indicating I was happy with Python 3's default UTF-8 encoding.  I do
not know enough to know what these other UTF encodings offer.

> To be pedantic, Unicode strings are sequences of abstract code points
> ("characters"). UTF-8 is a particular concrete implementation that is
> used to store or transmit such code strings. Here are examples of three
> possible encoding forms for the string 'πz':
>
> UTF-16: either two, or four, bytes per character: 03C0 007A
>
> UTF-32: exactly four bytes per character: 03C0 007A
>
> UTF-8: between one and four bytes per character: CF80 7A

I have not tallied up how many code points are actually assigned to
characters.  Does UTF-8 encoding currently cover all of them?  If yes,
why is there a need for other encodings?  Or by saying:

> (UTF-16 and UTF-32 are hardware-dependent, and the byte order could be
> reversed, e.g. C003 7A00. UTF-8 is not.)

do you mean that some hardware configurations require UTF-16 or UTF-32?

Thank you (and the others in this thread) for taking the time to
clarify these matters.

-- 
boB
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] sort() method and non-ASCII

2017-02-05 Thread boB Stepp

On Sun, Feb 5, 2017 at 5:25 PM, Cameron Simpson  wrote:

> You might want to drop this term "hexadecimal"; they're just ordinals (plain
> old numbers). Though Unicode ordinals are often _written_ in hexadecimal for
> compactness and because various character grouping are aligned on ranges
> based on power-of-2 multiples. Like ASCII has the upper case latin alphabet
> at 64 (2^6) and lower case at 96 (2^6 + 2^32). Those values look rounder in
> base 16: 0x40 and 0x60.

I will endeavor to use "code points" instead.  I am just used to
seeing these charts/tables in hexadecimal values.



-- 
boB
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] sort() method and non-ASCII

2017-02-05 Thread Steven D'Aprano

On Sun, Feb 05, 2017 at 04:31:43PM -0600, boB Stepp wrote:
> On Sat, Feb 4, 2017 at 10:50 PM, Random832  wrote:
> > On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote:
> >> Does the list sort() method (and other sort methods in Python) just go
> >> by the hex value assigned to each symbol to determine sort order in
> >> whichever Unicode encoding chart is being implemented?
> >
> > By default. You need key=locale.strxfrm to make it do anything more
> > sophisticated.
> >
> > I'm not sure what you mean by "whichever unicode encoding chart". Python
> > 3 strings are unicode-unicode, not UTF-8.
> 
> As I said in my response to Steve just now:  I was looking at
> http://unicode.org/charts/  Because they called them charts, so did I.

Ah, that makes sense! They're just reference tables ("charts") for the 
convenience of people wishing to find particular characters.

> I'm assuming that despite this organization into charts, each and
> every character in each chart has its own unique hexadecimal code to
> designate each character.

Correct, although strictly speaking the codes are only conventionally 
given in hexadecimal. They are numbered from 0 to 1114111 in 
decimal (although not all codes are currently used).

The terminology used is that a "code point" is what I've been calling a 
"character", although not all code points are characters. Code points 
are usually written either as the character itself, e.g. 'A', or using 
the notation U+0041 where there are at least four and no more than six 
hexadecimal digits following the "U+". 

Bringing this back to Python, if you know the code point (as a number), 
you can use the chr() function to return it as a string:

py> chr(960)
'π'

Don't forget that Python understands hex too!

py> chr(0x03C0)  # better than chr(int('03C0', 16))
'π'

Alternatively, you can embed it right in the string. For code points 
between U+ and U+, use the \u escape, and for the rest, use \U 
escapes:

py> 'pi = \u03C0'  # requires exactly four hex digits
'pi = π'

py> 'pi = \U03C0'  # requires exactly eight hex digits
'pi = π'

Lastly, you can use the code point's name:

py> 'pi = \N{GREEK SMALL LETTER PI}'
'pi = π'

One last comment: Random832 said:

"Python 3 strings are unicode-unicode, not UTF-8."

To be pedantic, Unicode strings are sequences of abstract code points 
("characters"). UTF-8 is a particular concrete implementation that is 
used to store or transmit such code strings. Here are examples of three 
possible encoding forms for the string 'πz':

UTF-16: either two, or four, bytes per character: 03C0 007A

UTF-32: exactly four bytes per character: 03C0 007A

UTF-8: between one and four bytes per character: CF80 7A

(UTF-16 and UTF-32 are hardware-dependent, and the byte order could be 
reversed, e.g. C003 7A00. UTF-8 is not.)

Prior to version 3.3, there was a built-time option to select either 
"narrow" or "wide" Unicode strings. A narrow build used a fixed two 
bytes per code point, together with an incomplete and not quite correct 
scheme for using two code points together to represent the supplementary 
Unicode characters U+1 through U+10. (This is sometimes called 
UCS-2, sometimes UTF-16, but strictly speaking it is neither, or at 
least an incomplete and "buggy" implementation of UTF-16.)

-- 
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] sort() method and non-ASCII

2017-02-05 Thread eryk sun

On Sun, Feb 5, 2017 at 10:30 PM, boB Stepp  wrote:
> I was looking at http://unicode.org/charts/  Because they called them
> charts, so did I.  I'm assuming that despite this organization into
> charts, each and every character in each chart has its own unique
> hexadecimal code to designate each character.

Those are PDF charts (i.e. tables) for Unicode blocks:

https://en.wikipedia.org/wiki/Unicode_block

A Unicode block always has a multiple of 16 codepoints, so it's
convenient to represent the ordinal values in hexadecimal.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] sort() method and non-ASCII

2017-02-05 Thread Cameron Simpson


On 05Feb2017 16:31, boB Stepp  wrote:

On Sat, Feb 4, 2017 at 10:50 PM, Random832  wrote:

On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote:

Does the list sort() method (and other sort methods in Python) just go
by the hex value assigned to each symbol to determine sort order in
whichever Unicode encoding chart is being implemented?


By default. You need key=locale.strxfrm to make it do anything more
sophisticated.

I'm not sure what you mean by "whichever unicode encoding chart". Python
3 strings are unicode-unicode, not UTF-8.


As I said in my response to Steve just now:  I was looking at
http://unicode.org/charts/  Because they called them charts, so did I.
I'm assuming that despite this organization into charts, each and
every character in each chart has its own unique hexadecimal code to
designate each character.


You might want to drop this term "hexadecimal"; they're just ordinals (plain 
old numbers). Though Unicode ordinals are often _written_ in hexadecimal for 
compactness and because various character grouping are aligned on ranges based 
on power-of-2 multiples. Like ASCII has the upper case latin alphabet at 64 
(2^6) and lower case at 96 (2^6 + 2^32). Those values look rounder in base 16: 
0x40 and 0x60.


Cheers,
Cameron Simpson 
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] sort() method and non-ASCII

2017-02-05 Thread boB Stepp

On Sat, Feb 4, 2017 at 10:50 PM, Random832  wrote:
> On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote:
>> Does the list sort() method (and other sort methods in Python) just go
>> by the hex value assigned to each symbol to determine sort order in
>> whichever Unicode encoding chart is being implemented?
>
> By default. You need key=locale.strxfrm to make it do anything more
> sophisticated.
>
> I'm not sure what you mean by "whichever unicode encoding chart". Python
> 3 strings are unicode-unicode, not UTF-8.

As I said in my response to Steve just now:  I was looking at
http://unicode.org/charts/  Because they called them charts, so did I.
I'm assuming that despite this organization into charts, each and
every character in each chart has its own unique hexadecimal code to
designate each character.

-- 
boB
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] sort() method and non-ASCII

2017-02-05 Thread boB Stepp

On Sun, Feb 5, 2017 at 2:32 AM, Steven D'Aprano  wrote:
> On Sat, Feb 04, 2017 at 09:52:47PM -0600, boB Stepp wrote:
>> Does the list sort() method (and other sort methods in Python) just go
>> by the hex value assigned to each symbol to determine sort order in
>> whichever Unicode encoding chart is being implemented?
>
> Correct, except that there is only one Unicode encoding chart.

I was looking at http://unicode.org/charts/  Because they called them
charts, so did I.  I'm assuming that despite this organization into
charts, each and every character in each chart has its own unique
hexadecimal code to designate each character.

-- 
boB
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Syntax error while attempting to type in multiline statements in the interactive interpreter

2017-02-05 Thread Alan Gauld via Tutor

On 05/02/17 01:29, boB Stepp wrote:

> But it seems to me on further thought that both REPL and what seems
> most consistent to me, "...wait until all the input has been read,
> then evaluate it all..." amounts to the same thing in the case of
> entering function definitions into the interpreter.  

Nope, the function definition is a single executable statement.
def is a command just like print. The command is to compile the
block of code and store it as a function object that can later
be called. So the interpreter is being completely consistent
in looking for the full definition in that case.

You are right that the interactive interpreter *could* have read
all input before executing it, but by design it doesn't. It is
an arbitrary choice to make the interpreter act on a single
input statement at a time (and personally I think a good one).

The problem with executing multiple statements at a time is
that your mistakes are often not where you think they are.
And by showing you the output of every statement as you go
you notice the error as it happens and don't mistakenly make
assumptions about which of the previous N statements
contains the bug. So personally I prefer the Python style
interpreter. Perl by contrast is more like your preference
and interprets to an EOF and I find that too easy to make
mistakes.

Best of all I guess is Smalltalk which executes any code
you highlight with a mouse...

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] sort() method and non-ASCII

2017-02-05 Thread Random832

On Sat, Feb 4, 2017, at 22:52, boB Stepp wrote:
> Does the list sort() method (and other sort methods in Python) just go
> by the hex value assigned to each symbol to determine sort order in
> whichever Unicode encoding chart is being implemented?

By default. You need key=locale.strxfrm to make it do anything more
sophisticated.

I'm not sure what you mean by "whichever unicode encoding chart". Python
3 strings are unicode-unicode, not UTF-8.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] sort() method and non-ASCII

2017-02-05 Thread Steven D'Aprano

On Sat, Feb 04, 2017 at 09:52:47PM -0600, boB Stepp wrote:
> Does the list sort() method (and other sort methods in Python) just go
> by the hex value assigned to each symbol to determine sort order in
> whichever Unicode encoding chart is being implemented?

Correct, except that there is only one Unicode encoding chart.

You may be thinking of the legacy Windows "code pages" system, where you 
can change the code page to re-interpret characters as different 
characters. E.g. ð in code page 1252 (Western European) becomes π in 
code page 1253 (Greek).

Python supports encoding and decoding to and from legacy code page 
forms, but Unicode itself does away with the idea of using separate code 
pages. It effectively is a single, giant code page containing room for 
over a million characters. It's also a superset of ASCII, so pure ASCII 
text can be identical in Unicode.

Anyhoo, since Unicode supports dozens of languages from all over the 
world, it defines "collation rules" for sorting text in various 
languages. For example, sorting in Austria is different from sorting in 
Germany, despite them both using the same alphabet. Even in English, 
sorting rules can vary: some phone books sort Mc and Mac together, some 
don't.

However, Python doesn't directly support that. It just provides a single 
basic lexicographic sort based on the ord() of each character in the 
string.

> If yes, then
> my expectation would be that the French "á" would come after the "z"
> character. 

Correct:

py> "á" > "z"
True
py> sorted('áz')
['z', 'á']

-- 
Steve

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Function annotations

2017-02-05 Thread Steven D'Aprano

On Sat, Feb 04, 2017 at 10:11:39PM -0600, boB Stepp wrote:

> Are the people making linters implementing checking function
> annotations?  Or is this something only gradually being adopted?

Depends which linter :-)

MyPy is still the reference implementation for type hinting in Python:

http://mypy-lang.org/

As for pychecker, pylint, jedi, pyflakes, etc you'll have to check with 
the individual application itself.

> Steve, are you making use of function annotations?  If yes, are you
> finding them worth the extra effort?

Not yet.


-- 
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] sort() method and non-ASCII

Re: [Tutor] sort() method and non-ASCII

Re: [Tutor] sort() method and non-ASCII

Re: [Tutor] sort() method and non-ASCII

Re: [Tutor] sort() method and non-ASCII

Re: [Tutor] sort() method and non-ASCII

Re: [Tutor] sort() method and non-ASCII

Re: [Tutor] sort() method and non-ASCII

Re: [Tutor] Syntax error while attempting to type in multiline statements in the interactive interpreter

Re: [Tutor] sort() method and non-ASCII

Re: [Tutor] sort() method and non-ASCII

Re: [Tutor] Function annotations

12 matches

Site Navigation

Mail list logo

Footer information