Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord()
Le dimanche 19 août 2012 10:56:36 UTC+2, Steven D'Aprano a écrit : > > internal implementation, and strings which fit exactly in Latin-1 will > And this is the crucial point. latin-1 is an obsolete and non usable coding scheme (esp. for european languages). We fall on the point I mentionned above. Microsoft know this, ditto for Apple, ditto for "TeX", ditto for the foundries. Even, "ISO" has recognized its error and produced iso-8859-15. The question? Why is it still used? jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord()
On Sun, 19 Aug 2012 09:43:13 +0200, Peter Otten wrote: > Steven D'Aprano wrote: >> I don't know where people are getting this myth that PEP 393 uses >> Latin-1 internally, it does not. Read the PEP, it explicitly states >> that 1-byte formats are only used for ASCII strings. > > From > > Python 3.3.0a4+ (default:10a8ad665749, Jun 9 2012, 08:57:51) [GCC > 4.6.1] on linux > Type "help", "copyright", "credits" or "license" for more information. import sys [sys.getsizeof("é"*i) for i in range(10)] > [49, 74, 75, 76, 77, 78, 79, 80, 81, 82] Interesting. Say, I don't suppose you're using a 64-bit build? Because that would explain why your sizes are so larger than mine: py> [sys.getsizeof("é"*i) for i in range(10)] [25, 38, 39, 40, 41, 42, 43, 44, 45, 46] py> [sys.getsizeof("€"*i) for i in range(10)] [25, 40, 42, 44, 46, 48, 50, 52, 54, 56] py> c = chr(0x + 1) py> [sys.getsizeof(c*i) for i in range(10)] [25, 44, 48, 52, 56, 60, 64, 68, 72, 76] On re-reading the PEP more closely, it looks like I did misunderstand the internal implementation, and strings which fit exactly in Latin-1 will also use 1 byte per character. There are three structures used: PyASCIIObject PyCompactUnicodeObject PyUnicodeObject and the third one comes in three variant forms, for 1-byte, 2-byte and 4- byte data. So I stand corrected. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord()
Steven D'Aprano wrote: > On Sat, 18 Aug 2012 19:34:50 +0100, MRAB wrote: > >> "a" will be stored as 1 byte/codepoint. >> >> Adding "é", it will still be stored as 1 byte/codepoint. > > Wrong. It will be 2 bytes, just like it already is in Python 3.2. > > I don't know where people are getting this myth that PEP 393 uses Latin-1 > internally, it does not. Read the PEP, it explicitly states that 1-byte > formats are only used for ASCII strings. From Python 3.3.0a4+ (default:10a8ad665749, Jun 9 2012, 08:57:51) [GCC 4.6.1] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> [sys.getsizeof("é"*i) for i in range(10)] [49, 74, 75, 76, 77, 78, 79, 80, 81, 82] >>> [sys.getsizeof("e"*i) for i in range(10)] [49, 50, 51, 52, 53, 54, 55, 56, 57, 58] >>> sys.getsizeof("é"*101)-sys.getsizeof("é") 100 >>> sys.getsizeof("e"*101)-sys.getsizeof("e") 100 >>> sys.getsizeof("€"*101)-sys.getsizeof("€") 200 I infer that (1) both ASCII and Latin1 strings require one byte per character. (2) Latin1 strings have a constant overhead of 24 bytes (on a 64bit system) over ASCII-only. -- http://mail.python.org/mailman/listinfo/python-list