Marc-Andre Lemburg added the comment: On 24.11.2015 02:30, Steven D'Aprano wrote: > > Steven D'Aprano added the comment: > > On Mon, Nov 23, 2015 at 09:48:46PM +0000, STINNER Victor wrote: > >> * the string has a cached UTF-8 byte string (ex: int(s) was called before >> the resize) > > Why do strings cache their UTF-8 encoding? > > I presume that some of Python's internals rely on the UTF-8 encoding > rather than the internal Latin-1/UCS-2/UTF-32 representation (PEP 393). > E.g. I infer from the above that int(s) parses the UTF-8 representation > of s rather than the internal representation. Is that right? > > Nevertheless, I wonder why the UTF-8 representation is cached. Is it > that expensive to generate that it can't be done on the fly, as needed? > As it stands now, non-ASCII strings may be up to twice as big as they > need be, once you include the UTF-8 cache. And, as this bug painfully > shows, the problem with caches is that you run the risk of the cache > being out of date.
The cache is needed because it's the only way to get a direct C char* to the object's UTF-8 representation without having to worry about memory management on the caller's side. Not having access to this would break a lot of code using the Python C API, since the cache is there per design. The speedup aspect is secondary. Unicode objects are normally immutable, but there are a few corner cases during the initialization of the objects where they are in fact mutable for a short while, e.g. when creating an empty object which is then filled with data and resized to the final length before passing it back to Python. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue25709> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com