Steven D'Aprano wrote: > On Sun, 19 Aug 2012 09:43:13 +0200, Peter Otten wrote: > >> Steven D'Aprano wrote: > >>> I don't know where people are getting this myth that PEP 393 uses >>> Latin-1 internally, it does not. Read the PEP, it explicitly states >>> that 1-byte formats are only used for ASCII strings. >> >> From >> >> Python 3.3.0a4+ (default:10a8ad665749, Jun 9 2012, 08:57:51) [GCC >> 4.6.1] on linux >> Type "help", "copyright", "credits" or "license" for more information. >>>>> import sys >>>>> [sys.getsizeof("é"*i) for i in range(10)] >> [49, 74, 75, 76, 77, 78, 79, 80, 81, 82] > > Interesting. Say, I don't suppose you're using a 64-bit build? Because > that would explain why your sizes are so larger than mine: > > py> [sys.getsizeof("é"*i) for i in range(10)] > [25, 38, 39, 40, 41, 42, 43, 44, 45, 46] > > > py> [sys.getsizeof("€"*i) for i in range(10)] > [25, 40, 42, 44, 46, 48, 50, 52, 54, 56]
Yes, I am using a 64-bit build. I thought that >> (2) Latin1 strings have a constant overhead of 24 bytes (on a 64bit >> system) over ASCII-only. would convey that. The corresponding data structure typedef struct { PyASCIIObject _base; Py_ssize_t utf8_length; char *utf8; Py_ssize_t wstr_length; } PyCompactUnicodeObject; makes for 12 extra bytes on 32 bit, and both Py_ssize_t and pointers double in size (from 4 to 8 bytes) on 64 bit. I'm sure you can do the maths for the embedded PyASCIIObject yourself. -- http://mail.python.org/mailman/listinfo/python-list