On 8/19/2012 6:42 PM, Chris Angelico wrote:
On Mon, Aug 20, 2012 at 3:34 AM, Terry Reedy <tjre...@udel.edu> wrote:
Python has often copied or borrowed, with adjustments. This time it is the first.
I should have added 'that I know of' ;-)
Maybe it wasn't consciously borrowed, but whatever innovation is done, there's usually an obscure beardless language that did it earlier. :) Pike has a single string type, which can use the full Unicode range. If all codepoints are <256, the string width is 8 (measured in bits); if <65536, width is 16; otherwise 32. Using the inbuilt count_memory function (similar to the Python function used somewhere earlier in this thread, but which I can't at present put my finger to), I find that for strings of 16 bytes or more, there's a fixed 20-byte header plus the string content, stored in the correct number of bytes. (Pike strings, like Python ones, are immutable and do not need expansion room.)
It is even possible that someone involved was even vaguely aware that there was an antecedent. The PEP makes no claim that I can see, but lays out the problem and goes right to details of a Python implementation.
However, Python goes a bit further by making it VERY clear that this is a mere optimization, and that Unicode strings and bytes strings are completely different beasts. In Pike, it's possible to forget to encode something before (say) writing it to a socket. Everything works fine while you have only ASCII characters in the string, and then breaks when you have a >255 codepoint - or perhaps worse, when you have a 127<x<256, and the other end misinterprets it.
Python writes strings to file objects, including open sockets, without creating a bytes object -- IF the file is opened in text mode, which always has an associated encoding, even if the default 'ascii'. From what you say, this is what Pike is missing.
I am pretty sure that the obvious optimization has already been done. The internal bytes of all-ascii text can safely be sent to a file with ascii (or ascii-compatible) encoding without intermediate 'decoding'. I remember several patches of that sort. If a string is internally ucs2 and the file is declared usc2 or utf-16 encoding, then again, pairs of bytes can go directly (possibly with a byte swap).
-- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list