On 8/19/2012 6:42 PM, Chris Angelico wrote:
On Mon, Aug 20, 2012 at 3:34 AM, Terry Reedy <tjre...@udel.edu> wrote:

Python has often copied or borrowed, with adjustments. This time it is the
first.

I should have added 'that I know of' ;-)

Maybe it wasn't consciously borrowed, but whatever innovation is done,
there's usually an obscure beardless language that did it earlier. :)

Pike has a single string type, which can use the full Unicode range.
If all codepoints are <256, the string width is 8 (measured in bits);
if <65536, width is 16; otherwise 32. Using the inbuilt count_memory
function (similar to the Python function used somewhere earlier in
this thread, but which I can't at present put my finger to), I find
that for strings of 16 bytes or more, there's a fixed 20-byte header
plus the string content, stored in the correct number of bytes. (Pike
strings, like Python ones, are immutable and do not need expansion
room.)

It is even possible that someone involved was even vaguely aware that there was an antecedent. The PEP makes no claim that I can see, but lays out the problem and goes right to details of a Python implementation.

However, Python goes a bit further by making it VERY clear that this
is a mere optimization, and that Unicode strings and bytes strings are
completely different beasts. In Pike, it's possible to forget to
encode something before (say) writing it to a socket. Everything works
fine while you have only ASCII characters in the string, and then
breaks when you have a >255 codepoint - or perhaps worse, when you
have a 127<x<256, and the other end misinterprets it.

Python writes strings to file objects, including open sockets, without creating a bytes object -- IF the file is opened in text mode, which always has an associated encoding, even if the default 'ascii'. From what you say, this is what Pike is missing.

I am pretty sure that the obvious optimization has already been done. The internal bytes of all-ascii text can safely be sent to a file with ascii (or ascii-compatible) encoding without intermediate 'decoding'. I remember several patches of that sort. If a string is internally ucs2 and the file is declared usc2 or utf-16 encoding, then again, pairs of bytes can go directly (possibly with a byte swap).


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to