Re: How do I display unicode value stored in a string variable using ord()

Terry Reedy Sun, 19 Aug 2012 17:39:15 -0700

On 8/19/2012 6:42 PM, Chris Angelico wrote:

On Mon, Aug 20, 2012 at 3:34 AM, Terry Reedy <[email protected]> wrote:

Python has often copied or borrowed, with adjustments. This time it is the
first.


I should have added 'that I know of' ;-)

Maybe it wasn't consciously borrowed, but whatever innovation is done,
there's usually an obscure beardless language that did it earlier. :)

Pike has a single string type, which can use the full Unicode range.
If all codepoints are <256, the string width is 8 (measured in bits);
if <65536, width is 16; otherwise 32. Using the inbuilt count_memory
function (similar to the Python function used somewhere earlier in
this thread, but which I can't at present put my finger to), I find
that for strings of 16 bytes or more, there's a fixed 20-byte header
plus the string content, stored in the correct number of bytes. (Pike
strings, like Python ones, are immutable and do not need expansion
room.)

It is even possible that someone involved was even vaguely aware thatthere was an antecedent. The PEP makes no claim that I can see, but laysout the problem and goes right to details of a Python implementation.

However, Python goes a bit further by making it VERY clear that this
is a mere optimization, and that Unicode strings and bytes strings are
completely different beasts. In Pike, it's possible to forget to
encode something before (say) writing it to a socket. Everything works
fine while you have only ASCII characters in the string, and then
breaks when you have a >255 codepoint - or perhaps worse, when you
have a 127<x<256, and the other end misinterprets it.

Python writes strings to file objects, including open sockets, withoutcreating a bytes object -- IF the file is opened in text mode, whichalways has an associated encoding, even if the default 'ascii'. Fromwhat you say, this is what Pike is missing.

I am pretty sure that the obvious optimization has already been done.The internal bytes of all-ascii text can safely be sent to a file withascii (or ascii-compatible) encoding without intermediate 'decoding'. Iremember several patches of that sort. If a string is internally ucs2and the file is declared usc2 or utf-16 encoding, then again, pairs ofbytes can go directly (possibly with a byte swap).



--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: How do I display unicode value stored in a string variable using ord()

Reply via email to