In reviewing memory usage, I found potential for saving more memory for ASCII-only strings. Both Victor and Guido commented that something like this be done; Antoine had asked whether there was anything that could be done. Here is the idea:
In an ASCII-only string, the UTF-8 representation is shared with the canonical one-byte representation. This would allow to drop the UTF-8 pointer and the UTF-8 length field; instead, a flag in the state would indicate that these fields are not there. Likewise, the wchar_t/Py_UNICODE length can be shared (even though the data cannot), since the ASCII-only string won't contain any surrogate pairs. To comply with the C aliasing rules, the structures would look like this: typedef struct { PyObject_HEAD Py_ssize_t length; union { void *any; Py_UCS1 *latin1; Py_UCS2 *ucs2; Py_UCS4 *ucs4; } data; Py_hash_t hash; int state; /* may include SSTATE_SHORT_ASCII flag */ wchar_t *wstr; } PyASCIIObject; typedef struct { PyASCIIObject _base; Py_ssize_t utf8_length; char *utf8; Py_ssize_t wstr_length; } PyUnicodeObject; Code that directly accesses the structures would become more complex; code that use the accessor macros wouldn't notice. As a result, ASCII-only strings would lose three pointers, and shrink to their 3.2 structure size. Since they also save in the individual characters, strings with more than 3 characters (16-bit Py_UNICODE) or more than one character (32-bit Py_UNICODE) would see a total size reduction compared to 3.2. Objects created throught the legacy API (PyUnicode_FromUnicode) that are only later found to be ASCII-only (in PyUnicode_Ready) would still have the UTF-8 pointer shared with the data pointer, but keep including separate fields for pointer & size. What do you think? Regards, Martin P.S. There are similar reductions that could be applied to the wstr_length in general: on 32-bit wchar_t systems, it could be always dropped, on a 16-bit wchar_t system, it could be dropped for UCS-2 strings. However, I'm not proposing these, as I think the increase in complexity is not worth the savings. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com