On Mon, Aug 20, 2012 at 10:35 AM, Terry Reedy <tjre...@udel.edu> wrote: > On 8/19/2012 6:42 PM, Chris Angelico wrote: >> However, Python goes a bit further by making it VERY clear that this >> is a mere optimization, and that Unicode strings and bytes strings are >> completely different beasts. In Pike, it's possible to forget to >> encode something before (say) writing it to a socket. Everything works >> fine while you have only ASCII characters in the string, and then >> breaks when you have a >255 codepoint - or perhaps worse, when you >> have a 127<x<256, and the other end misinterprets it. > > Python writes strings to file objects, including open sockets, without > creating a bytes object -- IF the file is opened in text mode, which always > has an associated encoding, even if the default 'ascii'. From what you say, > this is what Pike is missing.
In text mode, the library does the encoding, but an encoding still happens. > I am pretty sure that the obvious optimization has already been done. The > internal bytes of all-ascii text can safely be sent to a file with ascii (or > ascii-compatible) encoding without intermediate 'decoding'. I remember > several patches of that sort. If a string is internally ucs2 and the file is > declared usc2 or utf-16 encoding, then again, pairs of bytes can go directly > (possibly with a byte swap). Maybe it doesn't take any memory change, but there is a data type change. A Unicode string cannot be sent over the network; an encoding is needed. In Pike, I can take a string like "\x20AC" (or "\u20ac" or "\U000020ac", same thing) and manipulate it as a one-character string, but I cannot write it to a file or file-like object. I can, however, pass it through a codec (and there's string_to_utf8() for the convenience of the common case), and get back something like "\xe2\x82\xac", which is a three-byte string. The thing is, though, that this new string is of exactly the same data type as the original: 'string'. Which means that I could have a string containing Latin-1 but not ASCII characters, and Pike will happily write it to a socket without raising a compile-time or run-time error. Python, under the same circumstances, would either raise an error or quietly (and correctly) encode the data. But this is a relatively trivial point, in the scheme of things. Python has an excellent model now for handling Unicode strings, and I would STRONGLY recommend everyone to upgrade to 3.3. ChrisA -- http://mail.python.org/mailman/listinfo/python-list