python 3.3 repr
I'm trying to understand what's going on with this simple program if __name__=='__main__': print(repr=%s % repr(u'\xc1')) print(%%r=%r % u'\xc1') On my windows XP box this fails miserably if run directly at a terminal C:\tmp \Python33\python.exe bang.py Traceback (most recent call last): File bang.py, line 2, in module print(repr=%s % repr(u'\xc1')) File C:\Python33\lib\encodings\cp437.py, line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\xc1' in position 6: character maps to undefined If I run the program redirected into a file then no error occurs and the the result looks like this C:\tmpcat fff repr='┴' %r='┴' and if I run it into a pipe it works as though into a file. It seems that repr thinks it can render u'\xc1' directly which is a problem since print then seems to want to convert that to cp437 if directed into a terminal. I find the idea that print knows what it's printing to a bit dangerous, but it's the repr behaviour that strikes me as bad. What is responsible for defining the repr function's 'printable' so that repr would give me say an Ascii rendering? -confused-ly yrs- Robin Becker -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
On Friday, November 15, 2013 6:28:15 AM UTC-5, Robin Becker wrote: I'm trying to understand what's going on with this simple program if __name__=='__main__': print(repr=%s % repr(u'\xc1')) print(%%r=%r % u'\xc1') On my windows XP box this fails miserably if run directly at a terminal C:\tmp \Python33\python.exe bang.py Traceback (most recent call last): File bang.py, line 2, in module print(repr=%s % repr(u'\xc1')) File C:\Python33\lib\encodings\cp437.py, line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\xc1' in position 6: character maps to undefined If I run the program redirected into a file then no error occurs and the the result looks like this C:\tmpcat fff repr='┴' %r='┴' and if I run it into a pipe it works as though into a file. It seems that repr thinks it can render u'\xc1' directly which is a problem since print then seems to want to convert that to cp437 if directed into a terminal. I find the idea that print knows what it's printing to a bit dangerous, but it's the repr behaviour that strikes me as bad. What is responsible for defining the repr function's 'printable' so that repr would give me say an Ascii rendering? -confused-ly yrs- Robin Becker In Python3, repr() will return a Unicode string, and will preserve existing Unicode characters in its arguments. This has been controversial. To get the Python 2 behavior of a pure-ascii representation, there is the new builtin ascii(), and a corresponding %a format string. --Ned. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
On 15/11/2013 11:38, Ned Batchelder wrote: .. In Python3, repr() will return a Unicode string, and will preserve existing Unicode characters in its arguments. This has been controversial. To get the Python 2 behavior of a pure-ascii representation, there is the new builtin ascii(), and a corresponding %a format string. --Ned. thanks for this, edoesn't make the split across python2 - 3 any easier. -- Robin Becker -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
On Friday, November 15, 2013 7:16:52 AM UTC-5, Robin Becker wrote: On 15/11/2013 11:38, Ned Batchelder wrote: .. In Python3, repr() will return a Unicode string, and will preserve existing Unicode characters in its arguments. This has been controversial. To get the Python 2 behavior of a pure-ascii representation, there is the new builtin ascii(), and a corresponding %a format string. --Ned. thanks for this, edoesn't make the split across python2 - 3 any easier. -- Robin Becker No, but I've found that significant programs that run on both 2 and 3 need to have some shims to make the code work anyway. You could do this: try: repr = ascii except NameError: pass and then use repr throughout. --Ned. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
In article b6db8982-feac-4036-8ec4-2dc720d41...@googlegroups.com, Ned Batchelder n...@nedbatchelder.com wrote: In Python3, repr() will return a Unicode string, and will preserve existing Unicode characters in its arguments. This has been controversial. To get the Python 2 behavior of a pure-ascii representation, there is the new builtin ascii(), and a corresponding %a format string. I'm still stuck on Python 2, and while I can understand the controversy (It breaks my Python 2 code!), this seems like the right thing to have done. In Python 2, unicode is an add-on. One of the big design drivers in Python 3 was to make unicode the standard. The idea behind repr() is to provide a just plain text representation of an object. In P2, just plain text means ascii, so escaping non-ascii characters makes sense. In P3, just plain text means unicode, so escaping non-ascii characters no longer makes sense. Some of us have been doing this long enough to remember when just plain text meant only a single case of the alphabet (and a subset of ascii punctuation). On an ASR-33, your C program would print like: MAIN() \( PRINTF(HELLO, ASCII WORLD); \) because ASR-33's didn't have curly braces (or lower case). Having P3's repr() escape non-ascii characters today makes about as much sense as expecting P2's repr() to escape curly braces (and vertical bars, and a few others) because not every terminal can print those. -- Roy Smith r...@panix.com -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
On 15/11/2013 13:54, Ned Batchelder wrote: . No, but I've found that significant programs that run on both 2 and 3 need to have some shims to make the code work anyway. You could do this: try: repr = ascii except NameError: pass yes I tried that, but it doesn't affect %r which is inlined in unicodeobject.c, for me it seems easier to fix windows to use something like a standard encoding of utf8 ie cp65001, but that's quite hard to do globally. It seems sitecustomize is too late to set os.environ['PYTHONIOENCODING'], perhaps I can stuff that into one of the global environment vars and have it work for all python invocations. -- Robin Becker -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
15.11.13 15:54, Ned Batchelder написав(ла): No, but I've found that significant programs that run on both 2 and 3 need to have some shims to make the code work anyway. You could do this: try: repr = ascii except NameError: pass and then use repr throughout. Or rather try: ascii except NameError: ascii = repr and then use ascii throughout. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
.. I'm still stuck on Python 2, and while I can understand the controversy (It breaks my Python 2 code!), this seems like the right thing to have done. In Python 2, unicode is an add-on. One of the big design drivers in Python 3 was to make unicode the standard. The idea behind repr() is to provide a just plain text representation of an object. In P2, just plain text means ascii, so escaping non-ascii characters makes sense. In P3, just plain text means unicode, so escaping non-ascii characters no longer makes sense. unfortunately the word 'printable' got into the definition of repr; it's clear that printability is not the same as unicode at least as far as the print function is concerned. In my opinion it would have been better to leave the old behaviour as that would have eased the compatibility. The python gods don't count that sort of thing as important enough so we get the mess that is the python2/3 split. ReportLab has to do both so it's a real issue; in addition swapping the str - unicode pair to bytes str doesn't help one's mental models either :( Things went wrong when utf8 was not adopted as the standard encoding thus requiring two string types, it would have been easier to have a len function to count bytes as before and a glyphlen to count glyphs. Now as I understand it we have a complicated mess under the hood for unicode objects so they have a variable representation to approximate an 8 bit representation when suitable etc etc etc. Some of us have been doing this long enough to remember when just plain text meant only a single case of the alphabet (and a subset of ascii punctuation). On an ASR-33, your C program would print like: MAIN() \( PRINTF(HELLO, ASCII WORLD); \) because ASR-33's didn't have curly braces (or lower case). Having P3's repr() escape non-ascii characters today makes about as much sense as expecting P2's repr() to escape curly braces (and vertical bars, and a few others) because not every terminal can print those. . I can certainly remember those days, how we cried and laughed when 8 bits became popular. -- Robin Becker -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
Some of us have been doing this long enough to remember when just plain text meant only a single case of the alphabet (and a subset of ascii punctuation). On an ASR-33, your C program would print like: MAIN() \( PRINTF(HELLO, ASCII WORLD); \) because ASR-33's didn't have curly braces (or lower case). Having P3's repr() escape non-ascii characters today makes about as much sense as expecting P2's repr() to escape curly braces (and vertical bars, and a few others) because not every terminal can print those. . I can certainly remember those days, how we cried and laughed when 8 bits became popular. Really? you cried and laughed over 7 vs. 8 bits? That's lovely (?). ;). That eighth bit sure was less confusing than codepoint translations -- Robin Becker -- https://mail.python.org/mailman/listinfo/python-list -- Joel Goldstick http://joelgoldstick.com -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
On 15/11/2013 14:40, Serhiy Storchaka wrote: .. and then use repr throughout. Or rather try: ascii except NameError: ascii = repr and then use ascii throughout. apparently you can import ascii from future_builtins and the print() function is available as from __future__ import print_function nothing fixes all those %r formats to be %a though :( -- Robin Becker -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
... became popular. Really? you cried and laughed over 7 vs. 8 bits? That's lovely (?). ;). That eighth bit sure was less confusing than codepoint translations no we had 6 bits in 60 bit words as I recall; extracting the nth character involved division by 6; smart people did tricks with inverted multiplications etc etc :( -- Robin Becker -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
On Fri, Nov 15, 2013 at 10:03 AM, Robin Becker ro...@reportlab.com wrote: ... became popular. Really? you cried and laughed over 7 vs. 8 bits? That's lovely (?). ;). That eighth bit sure was less confusing than codepoint translations no we had 6 bits in 60 bit words as I recall; extracting the nth character involved division by 6; smart people did tricks with inverted multiplications etc etc :( -- Cool, someone here is older than me! I came in with the 8080, and I remember split octal, but sixes are something I missed out on. Robin Becker -- Joel Goldstick http://joelgoldstick.com -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
On Friday, November 15, 2013 9:43:17 AM UTC-5, Robin Becker wrote: Things went wrong when utf8 was not adopted as the standard encoding thus requiring two string types, it would have been easier to have a len function to count bytes as before and a glyphlen to count glyphs. Now as I understand it we have a complicated mess under the hood for unicode objects so they have a variable representation to approximate an 8 bit representation when suitable etc etc etc. Dealing with bytes and Unicode is complicated, and the 2-3 transition is not easy, but let's please not spread the misunderstanding that somehow the Flexible String Representation is at fault. However you store Unicode code points, they are different than bytes, and it is complex having to deal with both. You can't somehow make the dichotomy go away, you can only choose where you want to think about it. --Ned. -- Robin Becker -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
On Sat, Nov 16, 2013 at 1:43 AM, Robin Becker ro...@reportlab.com wrote: .. I'm still stuck on Python 2, and while I can understand the controversy (It breaks my Python 2 code!), this seems like the right thing to have done. In Python 2, unicode is an add-on. One of the big design drivers in Python 3 was to make unicode the standard. The idea behind repr() is to provide a just plain text representation of an object. In P2, just plain text means ascii, so escaping non-ascii characters makes sense. In P3, just plain text means unicode, so escaping non-ascii characters no longer makes sense. unfortunately the word 'printable' got into the definition of repr; it's clear that printability is not the same as unicode at least as far as the print function is concerned. In my opinion it would have been better to leave the old behaviour as that would have eased the compatibility. Printable means many different things in different contexts. In some contexts, the sequence \x66\x75\x63\x6b is considered unprintable, yet each of those characters is perfectly displayable in its natural form. Under IDLE, non-BMP characters can't be displayed (or at least, that's how it has been; I haven't checked current status on that one). On Windows, the console runs in codepage 437 by default (again, I may be wrong here), so anything not representable in that has to be escaped. My Linux box has its console set to full Unicode, everything working perfectly, so any non-control character can be printed. As far as Python's concerned, all of that is outside - something is printable if it's printable within Unicode, and the other hassles are matters of encoding. (Except the first one. I don't think there's an encoding g-rated.) The python gods don't count that sort of thing as important enough so we get the mess that is the python2/3 split. ReportLab has to do both so it's a real issue; in addition swapping the str - unicode pair to bytes str doesn't help one's mental models either :( That's fixing, in effect, a long-standing bug - of a sort. The name str needs to be applied to the most normal string type. As of Python 3, that's a Unicode string, which is as it should be. In Python 2, it was the ASCII/bytes string, which still fit the description of most normal string type, but that means that Python 2 programs are Unicode-unaware by default, which is a flaw. Hence the Py3 fix. Things went wrong when utf8 was not adopted as the standard encoding thus requiring two string types, it would have been easier to have a len function to count bytes as before and a glyphlen to count glyphs. Now as I understand it we have a complicated mess under the hood for unicode objects so they have a variable representation to approximate an 8 bit representation when suitable etc etc etc. http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/ There are languages that do what you describe. It's very VERY easy to break stuff. What happens when you slice a string? foo = asdf foo[:2],foo[2:] ('as', 'df') foo = q\u1234zy foo[:2],foo[2:] ('qሴ', 'zy') Looks good to me. I split a four-character string, I get two one-character strings. If that had been done in UTF-8, either I would need to know don't split at that boundary, that's between bytes in a character, or else the indexing and slicing would have to be done by counting characters from the beginning of the string - an O(n) operation, rather than an O(1) pointer arithmetic, not to mention that it'll blow your CPU cache (touching every part of a potentially-long string) just to find the position. The only reliable way to manage things is to work with true Unicode. You can completely ignore the internal CPython representation; what matters is that in Python (any implementation, as long as it conforms with version 3.3 or later) lets you index Unicode codepoints out of a Unicode string, without differentiating between those that happen to be ASCII, those that fit in a single byte, those that fit in two bytes, and those that are flagged RTL, because none of those considerations makes any difference to you. It takes some getting your head around, but it's worth it - same as using git instead of a Windows shared drive. (I'm still trying to push my family to think git.) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
On 15/11/2013 15:07, Joel Goldstick wrote: Cool, someone here is older than me! I came in with the 8080, and I remember split octal, but sixes are something I missed out on. The pdp 10/15 had 18 bit words and could be organized as 3*6 or 2*9, pdp 8s had 12 bits I think, then came the IBM 7094 which had 36 bits and finally the CDC6000 7600 machines with 60 bits, some one must have liked 6's -mumbling-ly yrs- Robin Becker -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
On Nov 15, 2013, at 10:18 AM, Robin Becker wrote: The pdp 10/15 had 18 bit words and could be organized as 3*6 or 2*9 I don't know about the 15, but the 10 had 36 bit words (18-bit halfwords). One common character packing was 5 7-bit characters per 36 bit word (with the sign bit left over). Anybody remember RAD-50? It let you represent a 6-character filename (plus a 3-character extension) in a 16 bit word. RT-11 used it, not sure if it showed up anywhere else. --- Roy Smith r...@panix.com -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
. Dealing with bytes and Unicode is complicated, and the 2-3 transition is not easy, but let's please not spread the misunderstanding that somehow the Flexible String Representation is at fault. However you store Unicode code points, they are different than bytes, and it is complex having to deal with both. You can't somehow make the dichotomy go away, you can only choose where you want to think about it. --Ned. ... I don't think that's what I said; the flexible representation is just an added complexity that has come about because of the wish to store strings in a compact way. The requirement for such complexity is the unicode type itself (especially the storage requirements) which necessitated some remedial action. There's no point in fighting the change to using unicode. The type wasn't required for any technical reason as other languages didn't go this route and are reasonably ok, but there's no doubt the change made things more difficult. -- Robin Becker -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
Op 15-11-13 16:39, Robin Becker schreef: . Dealing with bytes and Unicode is complicated, and the 2-3 transition is not easy, but let's please not spread the misunderstanding that somehow the Flexible String Representation is at fault. However you store Unicode code points, they are different than bytes, and it is complex having to deal with both. You can't somehow make the dichotomy go away, you can only choose where you want to think about it. --Ned. ... I don't think that's what I said; the flexible representation is just an added complexity ... No it is not, at least not for python programmers. (It of course is for the python implementors). The python programmer doesn't have to care about the flexible representation, just as the python programmer doesn't have to care about the internal reprensentation of (long) integers. It is an implemantation detail that is mostly ignorable. -- Antoon Pardon -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
On Sat, Nov 16, 2013 at 2:39 AM, Robin Becker ro...@reportlab.com wrote: Dealing with bytes and Unicode is complicated, and the 2-3 transition is not easy, but let's please not spread the misunderstanding that somehow the Flexible String Representation is at fault. However you store Unicode code points, they are different than bytes, and it is complex having to deal with both. You can't somehow make the dichotomy go away, you can only choose where you want to think about it. --Ned. ... I don't think that's what I said; the flexible representation is just an added complexity that has come about because of the wish to store strings in a compact way. The requirement for such complexity is the unicode type itself (especially the storage requirements) which necessitated some remedial action. There's no point in fighting the change to using unicode. The type wasn't required for any technical reason as other languages didn't go this route and are reasonably ok, but there's no doubt the change made things more difficult. There's no perceptible difference between a 3.2 wide build and the 3.3 flexible representation. (Differences with narrow builds are bugs, and have now been fixed.) As far as your script's concerned, Python 3.3 always stores strings in UTF-32, four bytes per character. It just happens to be way more efficient on memory, most of the time. Other languages _have_ gone for at least some sort of Unicode support. Unfortunately quite a few have done a half-way job and use UTF-16 as their internal representation. That means there's no difference between U+0012, U+0123, and U+1234, but U+12345 suddenly gets handled differently. ECMAScript actually specifies the perverse behaviour of treating codepoints U+ as two elements in a string, because it's just too costly to change. There are a small number of languages that guarantee correct Unicode handling. I believe bash scripts get this right (though I haven't tested; string manipulation in bash isn't nearly as rich as a proper text parsing language, so I don't dig into it much); Pike is a very Python-like language, and PEP 393 made Python even more Pike-like, because Pike's string has been variable width for as long as I've known it. A handful of other languages also guarantee UTF-32 semantics. All of them are really easy to work with; instead of writing your code and then going Oh, I wonder what'll happen if I give this thing weird characters?, you just write your code, safe in the knowledge that there is no such thing as a weird character (except for a few in the ASCII set... you may find that code breaks if given a newline in the middle of something, or maybe the slash confuses you). Definitely don't fight the change to Unicode, because it's not a change at all... it's just fixing what was buggy. You already had a difference between bytes and characters, you just thought you could ignore it. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
On Nov 15, 2013, at 10:18 AM, Robin Becker ro...@reportlab.com wrote: On 15/11/2013 15:07, Joel Goldstick wrote: Cool, someone here is older than me! I came in with the 8080, and I remember split octal, but sixes are something I missed out on. The pdp 10/15 had 18 bit words and could be organized as 3*6 or 2*9, pdp 8s had 12 bits I think, then came the IBM 7094 which had 36 bits and finally the CDC6000 7600 machines with 60 bits, some one must have liked 6's -mumbling-ly yrs- Robin Becker -- https://mail.python.org/mailman/listinfo/python-list Yes, the PDP-8s, LINC-8s, and PDP-12s were all 12-bit computers. However the LINC-8 operated with word-pairs (instruction in one location followed by address to be operated on in the next) so it was effectively a 24-bit computer and the PDP-12 was able to execute BOTH PDP-8 and LINC-8 instructions (it added one extra instruction to each set that flipped the mode). First assembly language program I ever wrote was on a PDP-12. (If there is an emoticon for a face with a gray beard, I don't know it.) -Bill -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
On Friday 15 November 2013 11:28:19 Joel Goldstick did opine: On Fri, Nov 15, 2013 at 10:03 AM, Robin Becker ro...@reportlab.com wrote: ... became popular. Really? you cried and laughed over 7 vs. 8 bits? That's lovely (?). ;). That eighth bit sure was less confusing than codepoint translations no we had 6 bits in 60 bit words as I recall; extracting the nth character involved division by 6; smart people did tricks with inverted multiplications etc etc :( -- Cool, someone here is older than me! I came in with the 8080, and I remember split octal, but sixes are something I missed out on. Ok, if you are feeling old decrepit, hows this for a birthday: 10/04/34, I came into micro computers about RCA 1802 time. Wrote a program for the 1802 without an assembler, for tape editing in '78 at KRCR-TV in Redding CA, that was still in use in '94, but never really wrote assembly code until the 6809 was out in the Radio Shack Color Computers. os9 on the coco's was the best teacher about the unix way of doing things there ever was. So I tell folks these days that I am 39, with 40 years experience at being 39. ;-) Robin Becker Cheers, Gene -- There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order. -Ed Howdershelt (Author) Counting in binary is just like counting in decimal -- if you are all thumbs. -- Glaser and Way A pen in the hand of this president is far more dangerous than 200 million guns in the hands of law-abiding citizens. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
: On Fri, Nov 15, 2013 at 10:32:54AM -0500, Roy Smith wrote: Anybody remember RAD-50? It let you represent a 6-character filename (plus a 3-character extension) in a 16 bit word. RT-11 used it, not sure if it showed up anywhere else. Presumably 16 is a typo, but I just had a moderate amount of fun envisaging how that might work: if the characters were restricted to vowels, then 5**6 2**14, giving a couple of bits left over for a choice of four preset three-character extensions. I can't say that AEIOUA.EX1 looks particularly appealing, though ... -[]z. -- Zero Piraeus: pollice verso http://etiol.net/pubkey.asc -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
On Sat, Nov 16, 2013 at 4:06 AM, Zero Piraeus z...@etiol.net wrote: : On Fri, Nov 15, 2013 at 10:32:54AM -0500, Roy Smith wrote: Anybody remember RAD-50? It let you represent a 6-character filename (plus a 3-character extension) in a 16 bit word. RT-11 used it, not sure if it showed up anywhere else. Presumably 16 is a typo, but I just had a moderate amount of fun envisaging how that might work: if the characters were restricted to vowels, then 5**6 2**14, giving a couple of bits left over for a choice of four preset three-character extensions. I can't say that AEIOUA.EX1 looks particularly appealing, though ... Looks like it might be this scheme: https://en.wikipedia.org/wiki/DEC_Radix-50 36-bit word for a 6-char filename, but there was also a 16-bit variant. I do like that filename scheme you describe, though it would tend to produce names that would suit virulent diseases. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
On Fri, 15 Nov 2013 14:43:17 +, Robin Becker wrote: Things went wrong when utf8 was not adopted as the standard encoding thus requiring two string types, it would have been easier to have a len function to count bytes as before and a glyphlen to count glyphs. Now as I understand it we have a complicated mess under the hood for unicode objects so they have a variable representation to approximate an 8 bit representation when suitable etc etc etc. No no no! Glyphs are *pictures*, you know the little blocks of pixels that you see on your monitor or printed on a page. Before you can count glyphs in a string, you need to know which typeface (font) is being used, since fonts generally lack glyphs for some code points. [Aside: there's another complication. Some fonts define alternate glyphs for the same code point, so that the design of (say) the letter a may vary within the one string according to whatever typographical rules the font supports and the application calls. So the question is, when you count glyphs, should you count a and alternate a as a single glyph or two?] You don't actually mean count glyphs, you mean counting code points (think characters, only with some complications that aren't important for the purposes of this discussion). UTF-8 is utterly unsuited for in-memory storage of text strings, I don't care how many languages (Go, Haskell?) make that mistake. When you're dealing with text strings, the fundamental unit is the character, not the byte. Why do you care how many bytes a text string has? If you really need to know how much memory an object is using, that's where you use sys.getsizeof(), not len(). We don't say len({42: None}) to discover that the dict requires 136 bytes, why would you use len(heåvy) to learn that it uses 23 bytes? UTF-8 is variable width encoding, which means it's *rubbish* for the in- memory representation of strings. Counting characters is slow. Slicing is slow. If you have mutable strings, deleting or inserting characters is slow. Every operation has to effectively start at the beginning of the string and count forward, lest it split bytes in the middle of a UTF unit. Or worse, the language doesn't give you any protection from this at all, so rather than slow string routines you have unsafe string routines, and it's your responsibility to detect UTF boundaries yourself. In case you aren't familiar with what I'm talking about, here's an example using Python 3.2, starting with a Unicode string and treating it as UTF-8 bytes: py u = heåvy py s = u.encode('utf-8') py for c in s: ... print(chr(c)) ... h e à ¥ v y Ã¥? It didn't take long to get moji-bake in our output, and all I did was print the (byte) string one character at a time. It gets worse: we can easily end up with invalid UTF-8: py a, b = s[:len(s)//2], s[len(s)//2:] # split the string in half py a.decode('utf-8') Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 2: unexpected end of data py b.decode('utf-8') Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte No, UTF-8 is okay for writing to files, but it's not suitable for text strings. The in-memory representation of text strings should be constant width, based on characters not bytes, and should prevent the caller from accidentally ending up with moji-bake or invalid strings. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
On Sat, Nov 16, 2013 at 4:10 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: No, UTF-8 is okay for writing to files, but it's not suitable for text strings. Correction: It's _great_ for writing to files (and other fundamentally byte-oriented streams, like network connections). Does a superb job as the default encoding for all sorts of situations. But, as you say, it sucks if you want to find the Nth character. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
15.11.13 17:32, Roy Smith написав(ла): Anybody remember RAD-50? It let you represent a 6-character filename (plus a 3-character extension) in a 16 bit word. RT-11 used it, not sure if it showed up anywhere else. In three 16-bit words. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
We don't say len({42: None}) to discover that the dict requires 136 bytes, why would you use len(heåvy) to learn that it uses 23 bytes ? #!/usr/bin/env python # -*- coding: utf-8 -*- illustrate the difference in length of python objects and the size of their system storage import sys s = heåvy d = { 42 : None } print print ' s : %s' % s print 'len( s ) : %d' % len( s ) print ' sys.getsizeof( s ) : %s ' % sys.getsizeof( s ) print print print ' d : ' , d print 'len( d ) : %d' % len( d ) print ' sys.getsizeof( d ) : %d ' % sys.getsizeof( d ) -- Stanley C. Kitching Human Being Phoenix, Arizona -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
On 2013-11-15, Chris Angelico ros...@gmail.com wrote: Other languages _have_ gone for at least some sort of Unicode support. Unfortunately quite a few have done a half-way job and use UTF-16 as their internal representation. That means there's no difference between U+0012, U+0123, and U+1234, but U+12345 suddenly gets handled differently. ECMAScript actually specifies the perverse behaviour of treating codepoints U+ as two elements in a string, because it's just too costly to change. The unicode support I'm learning in Go is, Everything is utf-8, right? RIGHT?!? It also has the interesting behavior that indexing strings retrieves bytes, while iterating over them results in a sequence of runes. It comes with support for no encodings save utf-8 (natively) and utf-16 (if you work at it). Is that really enough? -- Neil Cerutti -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
On 15/11/2013 16:36, Gene Heskett wrote: On Friday 15 November 2013 11:28:19 Joel Goldstick did opine: On Fri, Nov 15, 2013 at 10:03 AM, Robin Becker ro...@reportlab.com wrote: ... became popular. Really? you cried and laughed over 7 vs. 8 bits? That's lovely (?). ;). That eighth bit sure was less confusing than codepoint translations no we had 6 bits in 60 bit words as I recall; extracting the nth character involved division by 6; smart people did tricks with inverted multiplications etc etc :( -- Cool, someone here is older than me! I came in with the 8080, and I remember split octal, but sixes are something I missed out on. Ok, if you are feeling old decrepit, hows this for a birthday: 10/04/34, I came into micro computers about RCA 1802 time. Wrote a program for the 1802 without an assembler, for tape editing in '78 at KRCR-TV in Redding CA, that was still in use in '94, but never really wrote assembly code until the 6809 was out in the Radio Shack Color Computers. os9 on the coco's was the best teacher about the unix way of doing things there ever was. So I tell folks these days that I am 39, with 40 years experience at being 39. ;-) Robin Becker Cheers, Gene I also used the RCA 1802, but did you use the Ferranti F100L? Rationale for the use of both, mid/late 70s they were the only processors of their respective type with military approvals. Can't remember how we coded on the F100L, but the 1802 work was done on the Texas Instruments Silent 700, copying from one cassette tape to another. Set the controls wrong when copying and whoops, you've just overwritten the work you've just done. We could have had a decent development environment but it was on a UK MOD cost plus project, so the more inefficiently you worked, the more profit your employer made. -- Python is the second best programming language in the world. But the best has yet to be invented. Christian Tismer Mark Lawrence -- https://mail.python.org/mailman/listinfo/python-list
Unicode stdin/stdout (was: Re: python 3.3 repr)
Of course, the real solution to this issue is to replace sys.stdout on windows with an object that can handle Unicode directly with the WriteConsoleW function - the problem there is that it will break code that expects to be able to use sys.stdout.buffer for binary I/O. I also wasn't able to get the analogous stdin replacement class to work with input() in my attempts. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
On Friday 15 November 2013 13:52:40 Mark Lawrence did opine: On 15/11/2013 16:36, Gene Heskett wrote: On Friday 15 November 2013 11:28:19 Joel Goldstick did opine: On Fri, Nov 15, 2013 at 10:03 AM, Robin Becker ro...@reportlab.com wrote: ... became popular. Really? you cried and laughed over 7 vs. 8 bits? That's lovely (?). ;). That eighth bit sure was less confusing than codepoint translations no we had 6 bits in 60 bit words as I recall; extracting the nth character involved division by 6; smart people did tricks with inverted multiplications etc etc :( -- Cool, someone here is older than me! I came in with the 8080, and I remember split octal, but sixes are something I missed out on. Ok, if you are feeling old decrepit, hows this for a birthday: 10/04/34, I came into micro computers about RCA 1802 time. Wrote a program for the 1802 without an assembler, for tape editing in '78 at KRCR-TV in Redding CA, that was still in use in '94, but never really wrote assembly code until the 6809 was out in the Radio Shack Color Computers. os9 on the coco's was the best teacher about the unix way of doing things there ever was. So I tell folks these days that I am 39, with 40 years experience at being 39. ;-) Robin Becker Cheers, Gene I also used the RCA 1802, but did you use the Ferranti F100L? Rationale for the use of both, mid/late 70s they were the only processors of their respective type with military approvals. Can't remember how we coded on the F100L, but the 1802 work was done on the Texas Instruments Silent 700, copying from one cassette tape to another. Set the controls wrong when copying and whoops, you've just overwritten the work you've just done. We could have had a decent development environment but it was on a UK MOD cost plus project, so the more inefficiently you worked, the more profit your employer made. BTDT but in 1959-60 era. Testing the ullage pressure regulators for the early birds, including some that gave John Glenn his first ride or 2. I don't recall the brand of paper tape recorders, but they used 12at7's 12au7's by the grocery sack full. One or more got noisy me being the budding C.E.T. that I now am, of course ran down the bad ones and requested new ones. But you had to turn in the old ones, which Stellardyne Labs simply recycled back to you the next time you needed a few. Hopeless management IMO, but thats cost plus for you. At 10k$ a truckload for helium back then, each test lost about $3k worth of helium because the recycle catcher tank was so thin walled. And the 6 stage cardox re-compressor was so leaky, occasionally blowing up a pipe out of the last stage that put about 7800 lbs back in the monel tanks. I considered that a huge waste compared to the cost of a 12au7, then about $1.35, and raised hell, so I got fired. They simply did not care that a perfectly good regulator was being abused to death when it took 10 or more test runs to get one good recording for the certification. At those operating pressures, the valve faces erode just like the seats in your shower faucets do in 20 years. Ten such runs and you may as well bin it, but they didn't. I am amazed that as many of those birds worked as did. Of course if it wasn't manned, they didn't talk about the roman candles on the launch pads. I heard one story that they had to regrade one pads real estate at Vandenburg start all over, seems some ID10T had left the cable to the explosive bolts hanging on the cable tower. Ooops, and theres no off switch in many of those once the umbilical has been dropped. Cheers, Gene -- There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order. -Ed Howdershelt (Author) Tehee quod she, and clapte the wyndow to. -- Geoffrey Chaucer A pen in the hand of this president is far more dangerous than 200 million guns in the hands of law-abiding citizens. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
On 11/15/2013 6:28 AM, Robin Becker wrote: I'm trying to understand what's going on with this simple program if __name__=='__main__': print(repr=%s % repr(u'\xc1')) print(%%r=%r % u'\xc1') On my windows XP box this fails miserably if run directly at a terminal C:\tmp \Python33\python.exe bang.py Traceback (most recent call last): File bang.py, line 2, in module print(repr=%s % repr(u'\xc1')) File C:\Python33\lib\encodings\cp437.py, line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\xc1' in position 6: character maps to undefined If I run the program redirected into a file then no error occurs and the the result looks like this C:\tmpcat fff repr='┴' %r='┴' and if I run it into a pipe it works as though into a file. It seems that repr thinks it can render u'\xc1' directly which is a problem since print then seems to want to convert that to cp437 if directed into a terminal. I find the idea that print knows what it's printing to a bit dangerous, print() just calls file.write(s), where file defaults to sys.stdout, for each string fragment it creates. write(s) *has* to encode s to bytes according to some encoding, and it uses the encoding associated with the file when it was opened. but it's the repr behaviour that strikes me as bad. What is responsible for defining the repr function's 'printable' so that repr would give me say an Ascii rendering? That is not repr's job. Perhaps you are looking for repr(u'\xc1') 'Á' ascii(u'\xc1') '\\xc1' The above is with Idle on Win7. It is *much* better than the intentionally crippled console for working with the BMP subset of unicode. -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: python 3.3 repr
On Fri, 15 Nov 2013 17:47:01 +, Neil Cerutti wrote: The unicode support I'm learning in Go is, Everything is utf-8, right? RIGHT?!? It also has the interesting behavior that indexing strings retrieves bytes, while iterating over them results in a sequence of runes. It comes with support for no encodings save utf-8 (natively) and utf-16 (if you work at it). Is that really enough? Only if you never need to handle data created by other applications. -- Steven -- https://mail.python.org/mailman/listinfo/python-list