Victor Stinner, 25.08.2011 00:29:
With this PEP, the unicode object overhead grows to 10 pointer-sized
words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine.
Does it have any adverse effects?

For pure ASCII, it might be possible to use a shorter struct:

typedef struct {
     PyObject_HEAD
     Py_ssize_t length;
     Py_hash_t hash;
     int state;
     Py_ssize_t wstr_length;
     wchar_t *wstr;
     /* no more utf8_length, utf8, str */
     /* followed by ascii data */
} _PyASCIIObject;
(-2 pointer -1 ssize_t: 56 bytes)

=>  "a" is 58 bytes (with utf8 for free, without wchar_t)

For object allocated with the new API, we can use a shorter struct:

typedef struct {
     PyObject_HEAD
     Py_ssize_t length;
     Py_hash_t hash;
     int state;
     Py_ssize_t wstr_length;
     wchar_t *wstr;
     Py_ssize_t utf8_length;
     char *utf8;
     /* no more str pointer */
     /* followed by latin1/ucs2/ucs4 data */
} _PyNewUnicodeObject;
(-1 pointer: 72 bytes)

=>  "é" is 74 bytes (without utf8 / wchar_t)

For the legacy API:

typedef struct {
     PyObject_HEAD
     Py_ssize_t length;
     Py_hash_t hash;
     int state;
     Py_ssize_t wstr_length;
     wchar_t *wstr;
     Py_ssize_t utf8_length;
     char *utf8;
     void *str;
} _PyLegacyUnicodeObject;
(same size: 80 bytes)

=>  "a" is 80+2 (2 malloc) bytes (without utf8 / wchar_t)

The current struct:

typedef struct {
     PyObject_HEAD
     Py_ssize_t length;
     Py_UNICODE *str;
     Py_hash_t hash;
     int state;
     PyObject *defenc;
} PyUnicodeObject;

=>  "a" is 56+2 (2 malloc) bytes (without utf8, with wchar_t if Py_UNICODE is
wchar_t)

... but the code (maybe only the macros?) and debuging will be more complex.

That's an interesting idea. However, it's not required to do this as part of the PEP 393 implementation. This can be added later on if the need evidently arises in general practice.

Also, there is always the possibility to simply intern very short strings in order to avoid their multiplication in memory. Long strings don't suffer from this as the data size quickly dominates. User code that works with a lot of short strings would likely do the same.

BTW, I would expect that many short strings either go away as quickly as they appeared (e.g. in a parser) or were brought in as literals and are therefore interned anyway. That's just one reason why I suggest to wait for a prove of inefficiency in the real world (and, obviously, to test your own code with this as quickly as possible).


Will the format codes returning a Py_UNICODE pointer with
PyArg_ParseTuple be deprecated?

Because Python 2.x is still dominant and it's already hard enough to port C
modules, it's not the best moment to deprecate the legacy API (Py_UNICODE*).

Well, it will be quite inefficient in future CPython versions, so I think if it's not officially deprecated at some point, it will deprecate itself for efficiency reasons. Better make it clear that it's worth investing in better performance here.


Do you think the wstr representation could be removed in some future
version of Python?

Conversion to wchar_t* is common, especially on Windows.

That's an issue. However, I cannot say how common this really is in practice. Surely depends on the specific code, right? How common is it in core CPython?


But I don't know if
we *have to* cache the result. Is it cached by the way? Or is wstr only used
when a string is created from Py_UNICODE?

If it's so common on Windows, maybe it should only be cached there?

Stefan

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to