In reviewing memory usage, I found potential for saving more memory for
ASCII-only strings. Both Victor and Guido commented that something like
this be done; Antoine had asked whether there was anything that could
be done. Here is the idea:

In an ASCII-only string, the UTF-8 representation is shared with the
canonical one-byte representation. This would allow to drop the
UTF-8 pointer and the UTF-8 length field; instead, a flag in the state
would indicate that these fields are not there.

Likewise, the wchar_t/Py_UNICODE length can be shared (even though the
data cannot), since the ASCII-only string won't contain any surrogate
pairs.

To comply with the C aliasing rules, the structures would look like this:

typedef struct {
    PyObject_HEAD
    Py_ssize_t length;
    union {
        void *any;
        Py_UCS1 *latin1;
        Py_UCS2 *ucs2;
        Py_UCS4 *ucs4;
    } data;
    Py_hash_t hash;
    int state;     /* may include SSTATE_SHORT_ASCII flag */
    wchar_t *wstr;
} PyASCIIObject;


typedef struct {
    PyASCIIObject _base;
    Py_ssize_t utf8_length;
    char *utf8;
    Py_ssize_t wstr_length;
} PyUnicodeObject;

Code that directly accesses the structures would become more
complex; code that use the accessor macros wouldn't notice.

As a result, ASCII-only strings would lose three pointers,
and shrink to their 3.2 structure size. Since they also save
in the individual characters, strings with more than
3 characters (16-bit Py_UNICODE) or more than one character
(32-bit Py_UNICODE) would see a total size reduction compared
to 3.2.

Objects created throught the legacy API (PyUnicode_FromUnicode)
that are only later found to be ASCII-only (in PyUnicode_Ready)
would still have the UTF-8 pointer shared with the data pointer,
but keep including separate fields for pointer & size.

What do you think?

Regards,
Martin

P.S. There are similar reductions that could be applied
to the wstr_length in general: on 32-bit wchar_t systems,
it could be always dropped, on a 16-bit wchar_t system,
it could be dropped for UCS-2 strings. However, I'm not
proposing these, as I think the increase in complexity
is not worth the savings.
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to