python 3.3 repr

2013-11-15 Thread Robin Becker

I'm trying to understand what's going on with this simple program

if __name__=='__main__':
print(repr=%s % repr(u'\xc1'))
print(%%r=%r % u'\xc1')

On my windows XP box this fails miserably if run directly at a terminal

C:\tmp \Python33\python.exe bang.py
Traceback (most recent call last):
  File bang.py, line 2, in module
print(repr=%s % repr(u'\xc1'))
  File C:\Python33\lib\encodings\cp437.py, line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xc1' in position 6: 
character maps to undefined


If I run the program redirected into a file then no error occurs and the the 
result looks like this


C:\tmpcat fff
repr='┴'
%r='┴'

and if I run it into a pipe it works as though into a file.

It seems that repr thinks it can render u'\xc1' directly which is a problem 
since print then seems to want to convert that to cp437 if directed into a terminal.


I find the idea that print knows what it's printing to a bit dangerous, but it's 
the repr behaviour that strikes me as bad.


What is responsible for defining the repr function's 'printable' so that repr 
would give me say an Ascii rendering?

-confused-ly yrs-
Robin Becker

--
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Ned Batchelder
On Friday, November 15, 2013 6:28:15 AM UTC-5, Robin Becker wrote:
 I'm trying to understand what's going on with this simple program
 
 if __name__=='__main__':
   print(repr=%s % repr(u'\xc1'))
   print(%%r=%r % u'\xc1')
 
 On my windows XP box this fails miserably if run directly at a terminal
 
 C:\tmp \Python33\python.exe bang.py
 Traceback (most recent call last):
File bang.py, line 2, in module
  print(repr=%s % repr(u'\xc1'))
File C:\Python33\lib\encodings\cp437.py, line 19, in encode
  return codecs.charmap_encode(input,self.errors,encoding_map)[0]
 UnicodeEncodeError: 'charmap' codec can't encode character '\xc1' in position 
 6: 
 character maps to undefined
 
 If I run the program redirected into a file then no error occurs and the the 
 result looks like this
 
 C:\tmpcat fff
 repr='┴'
 %r='┴'
 
 and if I run it into a pipe it works as though into a file.
 
 It seems that repr thinks it can render u'\xc1' directly which is a problem 
 since print then seems to want to convert that to cp437 if directed into a 
 terminal.
 
 I find the idea that print knows what it's printing to a bit dangerous, but 
 it's 
 the repr behaviour that strikes me as bad.
 
 What is responsible for defining the repr function's 'printable' so that repr 
 would give me say an Ascii rendering?
 -confused-ly yrs-
 Robin Becker

In Python3, repr() will return a Unicode string, and will preserve existing 
Unicode characters in its arguments.  This has been controversial.  To get the 
Python 2 behavior of a pure-ascii representation, there is the new builtin 
ascii(), and a corresponding %a format string.

--Ned.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Robin Becker

On 15/11/2013 11:38, Ned Batchelder wrote:
..


In Python3, repr() will return a Unicode string, and will preserve existing 
Unicode characters in its arguments.  This has been controversial.  To get the 
Python 2 behavior of a pure-ascii representation, there is the new builtin 
ascii(), and a corresponding %a format string.

--Ned.



thanks for this, edoesn't make the split across python2 - 3 any easier.
--
Robin Becker
--
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Ned Batchelder
On Friday, November 15, 2013 7:16:52 AM UTC-5, Robin Becker wrote:
 On 15/11/2013 11:38, Ned Batchelder wrote:
 ..
 
  In Python3, repr() will return a Unicode string, and will preserve existing 
  Unicode characters in its arguments.  This has been controversial.  To get 
  the Python 2 behavior of a pure-ascii representation, there is the new 
  builtin ascii(), and a corresponding %a format string.
 
  --Ned.
 
 
 thanks for this, edoesn't make the split across python2 - 3 any easier.
 -- 
 Robin Becker

No, but I've found that significant programs that run on both 2 and 3 need to 
have some shims to make the code work anyway.  You could do this:

try:
repr = ascii
except NameError:
pass

and then use repr throughout.

--Ned.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Roy Smith
In article b6db8982-feac-4036-8ec4-2dc720d41...@googlegroups.com,
Ned Batchelder n...@nedbatchelder.com wrote:

 In Python3, repr() will return a Unicode string, and will preserve existing 
 Unicode characters in its arguments.  This has been controversial.  To get 
 the Python 2 behavior of a pure-ascii representation, there is the new 
 builtin ascii(), and a corresponding %a format string.

I'm still stuck on Python 2, and while I can understand the controversy (It 
breaks my Python 2 code!), this seems like the right thing to have done.  In 
Python 2, unicode is an add-on.  One of the big design drivers in Python 3 was 
to make unicode the standard.

The idea behind repr() is to provide a just plain text representation of an 
object.  In P2, just plain text means ascii, so escaping non-ascii characters 
makes sense.  In P3, just plain text means unicode, so escaping non-ascii 
characters no longer makes sense.

Some of us have been doing this long enough to remember when just plain text 
meant only a single case of the alphabet (and a subset of ascii punctuation).  
On an ASR-33, your C program would print like:

MAIN() \(
PRINTF(HELLO, ASCII WORLD);
\)

because ASR-33's didn't have curly braces (or lower case).

Having P3's repr() escape non-ascii characters today makes about as much sense 
as expecting P2's repr() to escape curly braces (and vertical bars, and a few 
others) because not every terminal can print those.

--
Roy Smith
r...@panix.com

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Robin Becker

On 15/11/2013 13:54, Ned Batchelder wrote:
.


No, but I've found that significant programs that run on both 2 and 3 need to 
have some shims to make the code work anyway.  You could do this:

 try:
 repr = ascii
 except NameError:
 pass


yes I tried that, but it doesn't affect %r which is inlined in unicodeobject.c, 
for me it seems easier to fix windows to use something like a standard encoding 
of utf8 ie cp65001, but that's quite hard to do globally. It seems sitecustomize 
is too late to set os.environ['PYTHONIOENCODING'], perhaps I can stuff that into 
one of the global environment vars and have it work for all python invocations.

--
Robin Becker

--
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Serhiy Storchaka

15.11.13 15:54, Ned Batchelder написав(ла):

No, but I've found that significant programs that run on both 2 and 3 need to 
have some shims to make the code work anyway.  You could do this:

 try:
 repr = ascii
 except NameError:
 pass

and then use repr throughout.


Or rather

try:
ascii
except NameError:
ascii = repr

and then use ascii throughout.


--
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Robin Becker

..

I'm still stuck on Python 2, and while I can understand the controversy (It breaks 
my Python 2 code!), this seems like the right thing to have done.  In Python 2, 
unicode is an add-on.  One of the big design drivers in Python 3 was to make unicode the 
standard.

The idea behind repr() is to provide a just plain text representation of an object.  In P2, 
just plain text means ascii, so escaping non-ascii characters makes sense.  In P3, just 
plain text means unicode, so escaping non-ascii characters no longer makes sense.



unfortunately the word 'printable' got into the definition of repr; it's clear 
that printability is not the same as unicode at least as far as the print 
function is concerned. In my opinion it would have been better to leave the old 
behaviour as that would have eased the compatibility.


The python gods don't count that sort of thing as important enough so we get the 
mess that is the python2/3 split. ReportLab has to do both so it's a real issue; 
in addition swapping the str - unicode pair to bytes str doesn't help one's 
mental models either :(


Things went wrong when utf8 was not adopted as the standard encoding thus 
requiring two string types, it would have been easier to have a len function to 
count bytes as before and a glyphlen to count glyphs. Now as I understand it we 
have a complicated mess under the hood for unicode objects so they have a 
variable representation to approximate an 8 bit representation when suitable etc 
etc etc.



Some of us have been doing this long enough to remember when just plain text 
meant only a single case of the alphabet (and a subset of ascii punctuation).  On an 
ASR-33, your C program would print like:

MAIN() \(
PRINTF(HELLO, ASCII WORLD);
\)

because ASR-33's didn't have curly braces (or lower case).

Having P3's repr() escape non-ascii characters today makes about as much sense 
as expecting P2's repr() to escape curly braces (and vertical bars, and a few 
others) because not every terminal can print those.


.
I can certainly remember those days, how we cried and laughed when 8 bits became 
popular.

--
Robin Becker

--
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Joel Goldstick
 Some of us have been doing this long enough to remember when just plain
 text meant only a single case of the alphabet (and a subset of ascii
 punctuation).  On an ASR-33, your C program would print like:

 MAIN() \(
 PRINTF(HELLO, ASCII WORLD);
 \)

 because ASR-33's didn't have curly braces (or lower case).

 Having P3's repr() escape non-ascii characters today makes about as much
 sense as expecting P2's repr() to escape curly braces (and vertical bars,
 and a few others) because not every terminal can print those.

 .
 I can certainly remember those days, how we cried and laughed when 8 bits
 became popular.

Really? you cried and laughed over 7 vs. 8 bits?  That's lovely (?).
;).  That eighth bit sure was less confusing than codepoint
translations


 --
 Robin Becker
 --
 https://mail.python.org/mailman/listinfo/python-list



-- 
Joel Goldstick
http://joelgoldstick.com
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Robin Becker

On 15/11/2013 14:40, Serhiy Storchaka wrote:
..



and then use repr throughout.


Or rather

 try:
 ascii
 except NameError:
 ascii = repr

and then use ascii throughout.




apparently you can import ascii from future_builtins and the print() function is 
available as


from __future__ import print_function

nothing fixes all those %r formats to be %a though :(
--
Robin Becker

--
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Robin Becker

...

became popular.


Really? you cried and laughed over 7 vs. 8 bits?  That's lovely (?).
;).  That eighth bit sure was less confusing than codepoint
translations



no we had 6 bits in 60 bit words as I recall; extracting the nth character 
involved division by 6; smart people did tricks with inverted multiplications 
etc etc  :(

--
Robin Becker
--
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Joel Goldstick
On Fri, Nov 15, 2013 at 10:03 AM, Robin Becker ro...@reportlab.com wrote:
 ...

 became popular.

 Really? you cried and laughed over 7 vs. 8 bits?  That's lovely (?).
 ;).  That eighth bit sure was less confusing than codepoint
 translations



 no we had 6 bits in 60 bit words as I recall; extracting the nth character
 involved division by 6; smart people did tricks with inverted
 multiplications etc etc  :(
 --

Cool, someone here is older than me!  I came in with the 8080, and I
remember split octal, but sixes are something I missed out on.
 Robin Becker



-- 
Joel Goldstick
http://joelgoldstick.com
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Ned Batchelder
On Friday, November 15, 2013 9:43:17 AM UTC-5, Robin Becker wrote:
 Things went wrong when utf8 was not adopted as the standard encoding thus 
 requiring two string types, it would have been easier to have a len function 
 to 
 count bytes as before and a glyphlen to count glyphs. Now as I understand it 
 we 
 have a complicated mess under the hood for unicode objects so they have a 
 variable representation to approximate an 8 bit representation when suitable 
 etc 
 etc etc.
 

Dealing with bytes and Unicode is complicated, and the 2-3 transition is not 
easy, but let's please not spread the misunderstanding that somehow the 
Flexible String Representation is at fault.  However you store Unicode code 
points, they are different than bytes, and it is complex having to deal with 
both.  You can't somehow make the dichotomy go away, you can only choose where 
you want to think about it.

--Ned.

 -- 
 Robin Becker

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Chris Angelico
On Sat, Nov 16, 2013 at 1:43 AM, Robin Becker ro...@reportlab.com wrote:
 ..

 I'm still stuck on Python 2, and while I can understand the controversy
 (It breaks my Python 2 code!), this seems like the right thing to have
 done.  In Python 2, unicode is an add-on.  One of the big design drivers in
 Python 3 was to make unicode the standard.

 The idea behind repr() is to provide a just plain text representation of
 an object.  In P2, just plain text means ascii, so escaping non-ascii
 characters makes sense.  In P3, just plain text means unicode, so escaping
 non-ascii characters no longer makes sense.


 unfortunately the word 'printable' got into the definition of repr; it's
 clear that printability is not the same as unicode at least as far as the
 print function is concerned. In my opinion it would have been better to
 leave the old behaviour as that would have eased the compatibility.

Printable means many different things in different contexts. In some
contexts, the sequence \x66\x75\x63\x6b is considered unprintable, yet
each of those characters is perfectly displayable in its natural form.
Under IDLE, non-BMP characters can't be displayed (or at least, that's
how it has been; I haven't checked current status on that one). On
Windows, the console runs in codepage 437 by default (again, I may be
wrong here), so anything not representable in that has to be escaped.
My Linux box has its console set to full Unicode, everything working
perfectly, so any non-control character can be printed. As far as
Python's concerned, all of that is outside - something is printable
if it's printable within Unicode, and the other hassles are matters of
encoding. (Except the first one. I don't think there's an encoding
g-rated.)

 The python gods don't count that sort of thing as important enough so we get
 the mess that is the python2/3 split. ReportLab has to do both so it's a
 real issue; in addition swapping the str - unicode pair to bytes str doesn't
 help one's mental models either :(

That's fixing, in effect, a long-standing bug - of a sort. The name
str needs to be applied to the most normal string type. As of Python
3, that's a Unicode string, which is as it should be. In Python 2, it
was the ASCII/bytes string, which still fit the description of most
normal string type, but that means that Python 2 programs are
Unicode-unaware by default, which is a flaw. Hence the Py3 fix.

 Things went wrong when utf8 was not adopted as the standard encoding thus
 requiring two string types, it would have been easier to have a len function
 to count bytes as before and a glyphlen to count glyphs. Now as I understand
 it we have a complicated mess under the hood for unicode objects so they
 have a variable representation to approximate an 8 bit representation when
 suitable etc etc etc.

http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/

There are languages that do what you describe. It's very VERY easy to
break stuff. What happens when you slice a string?

 foo = asdf
 foo[:2],foo[2:]
('as', 'df')

 foo = q\u1234zy
 foo[:2],foo[2:]
('qሴ', 'zy')

Looks good to me. I split a four-character string, I get two
one-character strings. If that had been done in UTF-8, either I would
need to know don't split at that boundary, that's between bytes in a
character, or else the indexing and slicing would have to be done by
counting characters from the beginning of the string - an O(n)
operation, rather than an O(1) pointer arithmetic, not to mention that
it'll blow your CPU cache (touching every part of a potentially-long
string) just to find the position.

The only reliable way to manage things is to work with true Unicode.
You can completely ignore the internal CPython representation; what
matters is that in Python (any implementation, as long as it conforms
with version 3.3 or later) lets you index Unicode codepoints out of a
Unicode string, without differentiating between those that happen to
be ASCII, those that fit in a single byte, those that fit in two
bytes, and those that are flagged RTL, because none of those
considerations makes any difference to you.

It takes some getting your head around, but it's worth it - same as
using git instead of a Windows shared drive. (I'm still trying to push
my family to think git.)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Robin Becker

On 15/11/2013 15:07, Joel Goldstick wrote:






Cool, someone here is older than me!  I came in with the 8080, and I
remember split octal, but sixes are something I missed out on.


The pdp 10/15 had 18 bit words and could be organized as 3*6 or 2*9, pdp 8s had 
12 bits I think, then came the IBM 7094 which had 36 bits and finally the 
CDC6000  7600 machines with 60 bits, some one must have liked 6's

-mumbling-ly yrs-
Robin Becker
--
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Roy Smith
On Nov 15, 2013, at 10:18 AM, Robin Becker wrote:

 The pdp 10/15 had 18 bit words and could be organized as 3*6 or 2*9

I don't know about the 15, but the 10 had 36 bit words (18-bit halfwords).  One 
common character packing was 5 7-bit characters per 36 bit word (with the sign 
bit left over).

Anybody remember RAD-50?  It let you represent a 6-character filename (plus a 
3-character extension) in a 16 bit word.  RT-11 used it, not sure if it showed 
up anywhere else.

---
Roy Smith
r...@panix.com

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Robin Becker

.


Dealing with bytes and Unicode is complicated, and the 2-3 transition is not 
easy, but let's please not spread the misunderstanding that somehow the Flexible 
String Representation is at fault.  However you store Unicode code points, they 
are different than bytes, and it is complex having to deal with both.  You can't 
somehow make the dichotomy go away, you can only choose where you want to think 
about it.

--Ned.

...
I don't think that's what I said; the flexible representation is just an added 
complexity that has come about because of the wish to store strings in a compact 
way. The requirement for such complexity is the unicode type itself (especially 
the storage requirements) which necessitated some remedial action.


There's no point in fighting the change to using unicode. The type wasn't 
required for any technical reason as other languages didn't go this route and 
are reasonably ok, but there's no doubt the change made things more difficult.

--
Robin Becker
--
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Antoon Pardon
Op 15-11-13 16:39, Robin Becker schreef:
 .

 Dealing with bytes and Unicode is complicated, and the 2-3 transition
 is not easy, but let's please not spread the misunderstanding that
 somehow the Flexible String Representation is at fault.  However you
 store Unicode code points, they are different than bytes, and it is
 complex having to deal with both.  You can't somehow make the
 dichotomy go away, you can only choose where you want to think about it.

 --Ned.
 ...
 I don't think that's what I said; the flexible representation is just an
 added complexity ...

No it is not, at least not for python programmers. (It of course is for
the python implementors). The python programmer doesn't have to care
about the flexible representation, just as the python programmer doesn't
have to care about the internal reprensentation of (long) integers. It
is an implemantation detail that is mostly ignorable.

-- 
Antoon Pardon

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Chris Angelico
On Sat, Nov 16, 2013 at 2:39 AM, Robin Becker ro...@reportlab.com wrote:
 Dealing with bytes and Unicode is complicated, and the 2-3 transition is
 not easy, but let's please not spread the misunderstanding that somehow the
 Flexible String Representation is at fault.  However you store Unicode code
 points, they are different than bytes, and it is complex having to deal with
 both.  You can't somehow make the dichotomy go away, you can only choose
 where you want to think about it.

 --Ned.

 ...
 I don't think that's what I said; the flexible representation is just an
 added complexity that has come about because of the wish to store strings in
 a compact way. The requirement for such complexity is the unicode type
 itself (especially the storage requirements) which necessitated some
 remedial action.

 There's no point in fighting the change to using unicode. The type wasn't
 required for any technical reason as other languages didn't go this route
 and are reasonably ok, but there's no doubt the change made things more
 difficult.

There's no perceptible difference between a 3.2 wide build and the 3.3
flexible representation. (Differences with narrow builds are bugs, and
have now been fixed.) As far as your script's concerned, Python 3.3
always stores strings in UTF-32, four bytes per character. It just
happens to be way more efficient on memory, most of the time.

Other languages _have_ gone for at least some sort of Unicode support.
Unfortunately quite a few have done a half-way job and use UTF-16 as
their internal representation. That means there's no difference
between U+0012, U+0123, and U+1234, but U+12345 suddenly gets handled
differently. ECMAScript actually specifies the perverse behaviour of
treating codepoints U+ as two elements in a string, because it's
just too costly to change.

There are a small number of languages that guarantee correct Unicode
handling. I believe bash scripts get this right (though I haven't
tested; string manipulation in bash isn't nearly as rich as a proper
text parsing language, so I don't dig into it much); Pike is a very
Python-like language, and PEP 393 made Python even more Pike-like,
because Pike's string has been variable width for as long as I've
known it. A handful of other languages also guarantee UTF-32
semantics. All of them are really easy to work with; instead of
writing your code and then going Oh, I wonder what'll happen if I
give this thing weird characters?, you just write your code, safe in
the knowledge that there is no such thing as a weird character
(except for a few in the ASCII set... you may find that code breaks if
given a newline in the middle of something, or maybe the slash
confuses you).

Definitely don't fight the change to Unicode, because it's not a
change at all... it's just fixing what was buggy. You already had a
difference between bytes and characters, you just thought you could
ignore it.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread William Ray Wing
On Nov 15, 2013, at 10:18 AM, Robin Becker ro...@reportlab.com wrote:

 On 15/11/2013 15:07, Joel Goldstick wrote:
 
 
 
 
 
 Cool, someone here is older than me!  I came in with the 8080, and I
 remember split octal, but sixes are something I missed out on.
 
 The pdp 10/15 had 18 bit words and could be organized as 3*6 or 2*9, pdp 8s 
 had 12 bits I think, then came the IBM 7094 which had 36 bits and finally the 
 CDC6000  7600 machines with 60 bits, some one must have liked 6's
 -mumbling-ly yrs-
 Robin Becker
 -- 
 https://mail.python.org/mailman/listinfo/python-list

Yes, the PDP-8s, LINC-8s, and PDP-12s were all 12-bit computers.  However the 
LINC-8 operated with word-pairs (instruction in one location followed by 
address to be operated on in the next) so it was effectively a 24-bit computer 
and the PDP-12 was able to execute BOTH PDP-8 and LINC-8 instructions (it added 
one extra instruction to each set that flipped the mode).

First assembly language program I ever wrote was on a PDP-12.  (If there is an 
emoticon for a face with a gray beard, I don't know it.)

-Bill
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Gene Heskett
On Friday 15 November 2013 11:28:19 Joel Goldstick did opine:

 On Fri, Nov 15, 2013 at 10:03 AM, Robin Becker ro...@reportlab.com 
wrote:
  ...
  
  became popular.
  
  Really? you cried and laughed over 7 vs. 8 bits?  That's lovely (?).
  ;).  That eighth bit sure was less confusing than codepoint
  translations
  
  no we had 6 bits in 60 bit words as I recall; extracting the nth
  character involved division by 6; smart people did tricks with
  inverted multiplications etc etc  :(
  --
 
 Cool, someone here is older than me!  I came in with the 8080, and I
 remember split octal, but sixes are something I missed out on.

Ok, if you are feeling old  decrepit, hows this for a birthday: 10/04/34, 
I came into micro computers about RCA 1802 time.  Wrote a program for the 
1802 without an assembler, for tape editing in '78 at KRCR-TV in Redding 
CA, that was still in use in '94, but never really wrote assembly code 
until the 6809 was out in the Radio Shack Color Computers.  os9 on the 
coco's was the best teacher about the unix way of doing things there ever 
was.  So I tell folks these days that I am 39, with 40 years experience at 
being 39. ;-)

  Robin Becker


Cheers, Gene
-- 
There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order.
-Ed Howdershelt (Author)

Counting in binary is just like counting in decimal -- if you are all 
thumbs.
-- Glaser and Way
A pen in the hand of this president is far more
dangerous than 200 million guns in the hands of
 law-abiding citizens.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Zero Piraeus
:

On Fri, Nov 15, 2013 at 10:32:54AM -0500, Roy Smith wrote:
 Anybody remember RAD-50?  It let you represent a 6-character filename
 (plus a 3-character extension) in a 16 bit word.  RT-11 used it, not
 sure if it showed up anywhere else.

Presumably 16 is a typo, but I just had a moderate amount of fun
envisaging how that might work: if the characters were restricted to
vowels, then 5**6  2**14, giving a couple of bits left over for a
choice of four preset three-character extensions.

I can't say that AEIOUA.EX1 looks particularly appealing, though ...

 -[]z.

-- 
Zero Piraeus: pollice verso
http://etiol.net/pubkey.asc
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Chris Angelico
On Sat, Nov 16, 2013 at 4:06 AM, Zero Piraeus z...@etiol.net wrote:
 :

 On Fri, Nov 15, 2013 at 10:32:54AM -0500, Roy Smith wrote:
 Anybody remember RAD-50?  It let you represent a 6-character filename
 (plus a 3-character extension) in a 16 bit word.  RT-11 used it, not
 sure if it showed up anywhere else.

 Presumably 16 is a typo, but I just had a moderate amount of fun
 envisaging how that might work: if the characters were restricted to
 vowels, then 5**6  2**14, giving a couple of bits left over for a
 choice of four preset three-character extensions.

 I can't say that AEIOUA.EX1 looks particularly appealing, though ...

Looks like it might be this scheme:

https://en.wikipedia.org/wiki/DEC_Radix-50

36-bit word for a 6-char filename, but there was also a 16-bit
variant. I do like that filename scheme you describe, though it would
tend to produce names that would suit virulent diseases.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Steven D'Aprano
On Fri, 15 Nov 2013 14:43:17 +, Robin Becker wrote:

 Things went wrong when utf8 was not adopted as the standard encoding
 thus requiring two string types, it would have been easier to have a len
 function to count bytes as before and a glyphlen to count glyphs. Now as
 I understand it we have a complicated mess under the hood for unicode
 objects so they have a variable representation to approximate an 8 bit
 representation when suitable etc etc etc.

No no no! Glyphs are *pictures*, you know the little blocks of pixels 
that you see on your monitor or printed on a page. Before you can count 
glyphs in a string, you need to know which typeface (font) is being 
used, since fonts generally lack glyphs for some code points.

[Aside: there's another complication. Some fonts define alternate glyphs 
for the same code point, so that the design of (say) the letter a may 
vary within the one string according to whatever typographical rules the 
font supports and the application calls. So the question is, when you 
count glyphs, should you count a and alternate a as a single glyph 
or two?]

You don't actually mean count glyphs, you mean counting code points 
(think characters, only with some complications that aren't important for 
the purposes of this discussion).

UTF-8 is utterly unsuited for in-memory storage of text strings, I don't 
care how many languages (Go, Haskell?) make that mistake. When you're 
dealing with text strings, the fundamental unit is the character, not the 
byte. Why do you care how many bytes a text string has? If you really 
need to know how much memory an object is using, that's where you use 
sys.getsizeof(), not len().

We don't say len({42: None}) to discover that the dict requires 136 
bytes, why would you use len(heåvy) to learn that it uses 23 bytes?

UTF-8 is variable width encoding, which means it's *rubbish* for the in-
memory representation of strings. Counting characters is slow. Slicing is 
slow. If you have mutable strings, deleting or inserting characters is 
slow. Every operation has to effectively start at the beginning of the 
string and count forward, lest it split bytes in the middle of a UTF 
unit. Or worse, the language doesn't give you any protection from this at 
all, so rather than slow string routines you have unsafe string routines, 
and it's your responsibility to detect UTF boundaries yourself. 

In case you aren't familiar with what I'm talking about, here's an 
example using Python 3.2, starting with a Unicode string and treating it 
as UTF-8 bytes:

py u = heåvy
py s = u.encode('utf-8')
py for c in s:
... print(chr(c))
...
h
e
Ã
¥
v
y


Ã¥? It didn't take long to get moji-bake in our output, and all I did 
was print the (byte) string one character at a time. It gets worse: we 
can easily end up with invalid UTF-8:

py a, b = s[:len(s)//2], s[len(s)//2:]  # split the string in half
py a.decode('utf-8')
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 2: 
unexpected end of data
py b.decode('utf-8')
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: 
invalid start byte


No, UTF-8 is okay for writing to files, but it's not suitable for text 
strings. The in-memory representation of text strings should be constant 
width, based on characters not bytes, and should prevent the caller from 
accidentally ending up with moji-bake or invalid strings.


-- 
Steven
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Chris Angelico
On Sat, Nov 16, 2013 at 4:10 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 No, UTF-8 is okay for writing to files, but it's not suitable for text
 strings.

Correction: It's _great_ for writing to files (and other fundamentally
byte-oriented streams, like network connections). Does a superb job as
the default encoding for all sorts of situations. But, as you say, it
sucks if you want to find the Nth character.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Serhiy Storchaka

15.11.13 17:32, Roy Smith написав(ла):

Anybody remember RAD-50?  It let you represent a 6-character filename
(plus a 3-character extension) in a 16 bit word.  RT-11 used it, not
sure if it showed up anywhere else.


In three 16-bit words.


--
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Cousin Stanley

 
 We don't say len({42: None}) to discover 
 that the dict requires 136 bytes, 
 why would you use len(heåvy) 
 to learn that it uses 23 bytes ?
 

#!/usr/bin/env python
# -*- coding: utf-8 -*-


illustrate the difference in length of python objects
and the size of their system storage


import sys

s = heåvy

d = { 42 :  None }

print
print '   s :  %s' % s
print 'len( s ) :  %d' % len( s )
print '  sys.getsizeof( s ) :  %s ' % sys.getsizeof( s )
print
print
print '   d : ' , d
print 'len( d ) :  %d' % len( d )
print '  sys.getsizeof( d ) :  %d ' % sys.getsizeof( d )


-- 
Stanley C. Kitching
Human Being
Phoenix, Arizona
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Neil Cerutti
On 2013-11-15, Chris Angelico ros...@gmail.com wrote:
 Other languages _have_ gone for at least some sort of Unicode
 support. Unfortunately quite a few have done a half-way job and
 use UTF-16 as their internal representation. That means there's
 no difference between U+0012, U+0123, and U+1234, but U+12345
 suddenly gets handled differently. ECMAScript actually
 specifies the perverse behaviour of treating codepoints U+
 as two elements in a string, because it's just too costly to
 change.

The unicode support I'm learning in Go is, Everything is utf-8,
right? RIGHT?!? It also has the interesting behavior that
indexing strings retrieves bytes, while iterating over them
results in a sequence of runes.

It comes with support for no encodings save utf-8 (natively) and
utf-16 (if you work at it). Is that really enough?

-- 
Neil Cerutti
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Mark Lawrence

On 15/11/2013 16:36, Gene Heskett wrote:

On Friday 15 November 2013 11:28:19 Joel Goldstick did opine:


On Fri, Nov 15, 2013 at 10:03 AM, Robin Becker ro...@reportlab.com

wrote:

...


became popular.


Really? you cried and laughed over 7 vs. 8 bits?  That's lovely (?).
;).  That eighth bit sure was less confusing than codepoint
translations


no we had 6 bits in 60 bit words as I recall; extracting the nth
character involved division by 6; smart people did tricks with
inverted multiplications etc etc  :(
--


Cool, someone here is older than me!  I came in with the 8080, and I
remember split octal, but sixes are something I missed out on.


Ok, if you are feeling old  decrepit, hows this for a birthday: 10/04/34,
I came into micro computers about RCA 1802 time.  Wrote a program for the
1802 without an assembler, for tape editing in '78 at KRCR-TV in Redding
CA, that was still in use in '94, but never really wrote assembly code
until the 6809 was out in the Radio Shack Color Computers.  os9 on the
coco's was the best teacher about the unix way of doing things there ever
was.  So I tell folks these days that I am 39, with 40 years experience at
being 39. ;-)


Robin Becker



Cheers, Gene



I also used the RCA 1802, but did you use the Ferranti F100L?  Rationale 
for the use of both, mid/late 70s they were the only processors of their 
respective type with military approvals.


Can't remember how we coded on the F100L, but the 1802 work was done on 
the Texas Instruments Silent 700, copying from one cassette tape to 
another.  Set the controls wrong when copying and whoops, you've just 
overwritten the work you've just done.  We could have had a decent 
development environment but it was on a UK MOD cost plus project, so the 
more inefficiently you worked, the more profit your employer made.


--
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list


Unicode stdin/stdout (was: Re: python 3.3 repr)

2013-11-15 Thread random832
Of course, the real solution to this issue is to replace sys.stdout on
windows with an object that can handle Unicode directly with the
WriteConsoleW function - the problem there is that it will break code
that expects to be able to use sys.stdout.buffer for binary I/O. I also
wasn't able to get the analogous stdin replacement class to work with
input() in my attempts.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Gene Heskett
On Friday 15 November 2013 13:52:40 Mark Lawrence did opine:

 On 15/11/2013 16:36, Gene Heskett wrote:
  On Friday 15 November 2013 11:28:19 Joel Goldstick did opine:
  On Fri, Nov 15, 2013 at 10:03 AM, Robin Becker ro...@reportlab.com
  
  wrote:
  ...
  
  became popular.
  
  Really? you cried and laughed over 7 vs. 8 bits?  That's lovely
  (?). ;).  That eighth bit sure was less confusing than codepoint
  translations
  
  no we had 6 bits in 60 bit words as I recall; extracting the nth
  character involved division by 6; smart people did tricks with
  inverted multiplications etc etc  :(
  --
  
  Cool, someone here is older than me!  I came in with the 8080, and I
  remember split octal, but sixes are something I missed out on.
  
  Ok, if you are feeling old  decrepit, hows this for a birthday:
  10/04/34, I came into micro computers about RCA 1802 time.  Wrote a
  program for the 1802 without an assembler, for tape editing in '78 at
  KRCR-TV in Redding CA, that was still in use in '94, but never really
  wrote assembly code until the 6809 was out in the Radio Shack Color
  Computers.  os9 on the coco's was the best teacher about the unix way
  of doing things there ever was.  So I tell folks these days that I am
  39, with 40 years experience at being 39. ;-)
  
  Robin Becker
  
  Cheers, Gene
 
 I also used the RCA 1802, but did you use the Ferranti F100L?  Rationale
 for the use of both, mid/late 70s they were the only processors of their
 respective type with military approvals.
 
 Can't remember how we coded on the F100L, but the 1802 work was done on
 the Texas Instruments Silent 700, copying from one cassette tape to
 another.  Set the controls wrong when copying and whoops, you've just
 overwritten the work you've just done.  We could have had a decent
 development environment but it was on a UK MOD cost plus project, so the
 more inefficiently you worked, the more profit your employer made.

BTDT but in 1959-60 era.  Testing the ullage pressure regulators for the 
early birds, including some that gave John Glenn his first ride or 2.  I 
don't recall the brand of paper tape recorders, but they used 12at7's  
12au7's by the grocery sack full.  One or more got noisy  me being the 
budding C.E.T. that I now am, of course ran down the bad ones and requested 
new ones.  But you had to turn in the old ones, which Stellardyne Labs 
simply recycled back to you the next time you needed a few.  Hopeless 
management IMO, but thats cost plus for you.

At 10k$ a truckload for helium back then, each test lost about $3k worth of 
helium because the recycle catcher tank was so thin walled.  And the 6 
stage cardox re-compressor was so leaky, occasionally blowing up a pipe out 
of the last stage that put about 7800 lbs back in the monel tanks.

I considered that a huge waste compared to the cost of a 12au7, then about 
$1.35, and raised hell, so I got fired.  They simply did not care that a 
perfectly good regulator was being abused to death when it took 10 or more 
test runs to get one good recording for the certification. At those 
operating pressures, the valve faces erode just like the seats in your 
shower faucets do in 20 years.  Ten such runs and you may as well bin it, 
but they didn't.

I am amazed that as many of those birds worked as did.  Of course if it 
wasn't manned, they didn't talk about the roman candles on the launch pads. 
I heard one story that they had to regrade one pads real estate at 
Vandenburg  start all over, seems some ID10T had left the cable to the 
explosive bolts hanging on the cable tower.  Ooops, and theres no off 
switch in many of those once the umbilical has been dropped.

Cheers, Gene
-- 
There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order.
-Ed Howdershelt (Author)

Tehee quod she, and clapte the wyndow to.
-- Geoffrey Chaucer
A pen in the hand of this president is far more
dangerous than 200 million guns in the hands of
 law-abiding citizens.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Terry Reedy

On 11/15/2013 6:28 AM, Robin Becker wrote:

I'm trying to understand what's going on with this simple program

if __name__=='__main__':
 print(repr=%s % repr(u'\xc1'))
 print(%%r=%r % u'\xc1')

On my windows XP box this fails miserably if run directly at a terminal

C:\tmp \Python33\python.exe bang.py
Traceback (most recent call last):
   File bang.py, line 2, in module
 print(repr=%s % repr(u'\xc1'))
   File C:\Python33\lib\encodings\cp437.py, line 19, in encode
 return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xc1' in
position 6: character maps to undefined

If I run the program redirected into a file then no error occurs and the
the result looks like this

C:\tmpcat fff
repr='┴'
%r='┴'

and if I run it into a pipe it works as though into a file.

It seems that repr thinks it can render u'\xc1' directly which is a
problem since print then seems to want to convert that to cp437 if
directed into a terminal.

I find the idea that print knows what it's printing to a bit dangerous,


print() just calls file.write(s), where file defaults to sys.stdout, for 
each string fragment it creates. write(s) *has* to encode s to bytes 
according to some encoding, and it uses the encoding associated with the 
file when it was opened.



but it's the repr behaviour that strikes me as bad.

What is responsible for defining the repr function's 'printable'

 so that repr would give me say an Ascii rendering?

That is not repr's job. Perhaps you are looking for
 repr(u'\xc1')
'Á'
 ascii(u'\xc1')
'\\xc1'
The above is with Idle on Win7. It is *much* better than the 
intentionally crippled console for working with the BMP subset of unicode.


--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list


Re: python 3.3 repr

2013-11-15 Thread Steven D'Aprano
On Fri, 15 Nov 2013 17:47:01 +, Neil Cerutti wrote:

 The unicode support I'm learning in Go is, Everything is utf-8, right?
 RIGHT?!? It also has the interesting behavior that indexing strings
 retrieves bytes, while iterating over them results in a sequence of
 runes.
 
 It comes with support for no encodings save utf-8 (natively) and utf-16
 (if you work at it). Is that really enough?

Only if you never need to handle data created by other applications.



-- 
Steven
-- 
https://mail.python.org/mailman/listinfo/python-list