Re: [Python-Dev] PEP 393 review

Stefan Behnel Wed, 24 Aug 2011 21:49:04 -0700

Victor Stinner, 25.08.2011 00:29:

With this PEP, the unicode object overhead grows to 10 pointer-sized
words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine.
Does it have any adverse effects?


For pure ASCII, it might be possible to use a shorter struct:

typedef struct {
     PyObject_HEAD
     Py_ssize_t length;
     Py_hash_t hash;
     int state;
     Py_ssize_t wstr_length;
     wchar_t *wstr;
     /* no more utf8_length, utf8, str */
     /* followed by ascii data */
} _PyASCIIObject;
(-2 pointer -1 ssize_t: 56 bytes)

=>  "a" is 58 bytes (with utf8 for free, without wchar_t)

For object allocated with the new API, we can use a shorter struct:

typedef struct {
     PyObject_HEAD
     Py_ssize_t length;
     Py_hash_t hash;
     int state;
     Py_ssize_t wstr_length;
     wchar_t *wstr;
     Py_ssize_t utf8_length;
     char *utf8;
     /* no more str pointer */
     /* followed by latin1/ucs2/ucs4 data */
} _PyNewUnicodeObject;
(-1 pointer: 72 bytes)

=>  "é" is 74 bytes (without utf8 / wchar_t)

For the legacy API:

typedef struct {
     PyObject_HEAD
     Py_ssize_t length;
     Py_hash_t hash;
     int state;
     Py_ssize_t wstr_length;
     wchar_t *wstr;
     Py_ssize_t utf8_length;
     char *utf8;
     void *str;
} _PyLegacyUnicodeObject;
(same size: 80 bytes)

=>  "a" is 80+2 (2 malloc) bytes (without utf8 / wchar_t)

The current struct:

typedef struct {
     PyObject_HEAD
     Py_ssize_t length;
     Py_UNICODE *str;
     Py_hash_t hash;
     int state;
     PyObject *defenc;
} PyUnicodeObject;

=>  "a" is 56+2 (2 malloc) bytes (without utf8, with wchar_t if Py_UNICODE is
wchar_t)

... but the code (maybe only the macros?) and debuging will be more complex.

That's an interesting idea. However, it's not required to do this as partof the PEP 393 implementation. This can be added later on if the needevidently arises in general practice.

Also, there is always the possibility to simply intern very short stringsin order to avoid their multiplication in memory. Long strings don't sufferfrom this as the data size quickly dominates. User code that works with alot of short strings would likely do the same.

BTW, I would expect that many short strings either go away as quickly asthey appeared (e.g. in a parser) or were brought in as literals and aretherefore interned anyway. That's just one reason why I suggest to wait fora prove of inefficiency in the real world (and, obviously, to test your owncode with this as quickly as possible).

Will the format codes returning a Py_UNICODE pointer with
PyArg_ParseTuple be deprecated?


Because Python 2.x is still dominant and it's already hard enough to port C
modules, it's not the best moment to deprecate the legacy API (Py_UNICODE*).

Well, it will be quite inefficient in future CPython versions, so I thinkif it's not officially deprecated at some point, it will deprecate itselffor efficiency reasons. Better make it clear that it's worth investing inbetter performance here.

Do you think the wstr representation could be removed in some future
version of Python?


Conversion to wchar_t* is common, especially on Windows.

That's an issue. However, I cannot say how common this really is inpractice. Surely depends on the specific code, right? How common is it incore CPython?

But I don't know if
we *have to* cache the result. Is it cached by the way? Or is wstr only used
when a string is created from Py_UNICODE?


If it's so common on Windows, maybe it should only be cached there?

Stefan

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 review

Reply via email to