> Resizing > -------- > > Codecs use resizing a lot. Given that PyCompactUnicodeObject > does not support resizing, most decoders will have to use > PyUnicodeObject and thus not benefit from the memory footprint > advantages of e.g. PyASCIIObject.
Wrong. Even if you create a string using the legacy API (e.g. PyUnicode_FromUnicode), the string will be quickly compacted to use the most efficient memory storage (depending on the maximum character). "quickly": at the first call to PyUnicode_READY. Python tries to make all strings ready as early as possible. > PyASCIIObject has a wchar_t *wstr pointer - I guess this should > be a char *str pointer, otherwise, where's the memory footprint > advantage (esp. on Linux where sizeof(wchar_t) == 4) ? For pure ASCII strings, you don't have to store a pointer to the UTF-8 string, nor the length of the UTF-8 string (in bytes), nor the length of the wchar_t string (in wide characters): the length is always the length of the "ASCII" string, and the UTF-8 string is shared with the ASCII string. The structure is much smaller thanks to these optimizations, and so Python 3.3 uses less memory than 2.7 for ASCII strings, even for short strings. > I also don't see a reason to limit the UCS1 storage version > to ASCII. Accordingly, the object should be called PyLatin1Object > or PyUCS1Object. Latin1 is less interesting, you cannot share length/data fields with utf8 or wstr. We didn't add a special case for Latin1 strings (except using Py_UCS1* strings to store their characters). > Furthermore, determining len(obj) will require a loop over > the data, checking for surrogate code points. A simple memcpy() > is no longer enough. Wrong. len(obj) gives the "right" result (see the long discussion about what is the length of a string in a previous thread...) in O(1) since it's computed when the string is created. > ... in practice you only > very rarely see any non-BMP code points in your data. Making > all Python users pay for the needs of a tiny fraction is > not really fair. Remember: practicality beats purity. The creation of the string is maybe is little bit slower (especially when you have to scan the string twice to first get the maximum character), but I think that this slow down is smaller than the speedup allowed by the PEP. Because ASCII strings are now char*, I think that processing ASCII strings is faster because the CPU can cache more data (close to the CPU). We can do better optimization on ASCII and Latin1 strings (it's faster to manipulate char* than uint16_t* or uint32_t*). For example, str.center(), str.ljust, str.rjust and str.zfill do now use the very fast memset() function for latin1 strings to pad the string. Another example, duplicating a string (or create a substring) should be faster just because you have less data to copy (e.g. 10 bytes for a string of 10 Latin1 characters vs 20 or 40 bytes with Python 3.2). The two most common encodings in the world are ASCII and UTF-8. With the PEP 393, encoding to ASCII or UTF-8 is free, you don't have to encode anything, you have directly the encoded char* buffer (whereas you have to convert 16/32 bit wchar_t to char* in Python 3.2, even for pure ASCII). (It's also free to encode "Latin1" Unicode string to Latin1.) With the PEP 393, we never have to decode UTF-16 anymore when iterating on code pointer to support correctly non-BMP characters (which was required before in narrow build, e.g. on Windows). Iterate on code point is just a dummy loop, no need to check if each character is in range U+D800-U+DFFF. There are other funny tricks (optimizations). For example, text.replace(a, b) knows that there is nothing to do if maxchar(a) > maxchar(text), where maxchar(obj) just requires to read an attribute of the string. Think about ASCII and non-ASCII strings: pure_ascii.replace('\xe9', '') now just creates a new reference... I don't think that Martin wrote his PEP to be able to implement all these optimisations, but there are an interesting side effect of his PEP :-) > The table only lists string sizes up 8 code points. The memory > savings for these are really only significant for ASCII > strings on 64-bit platforms, if you use the default UCS2 > Python build as basis. In the 32 different cases, the PEP 393 is better in 29 cases and "just" as good as Python 3.2 in 3 corner cases: - 1 ASCII, 16-bit wchar, 32-bit - 1 Latin1, 32-bit wchar, 32-bit - 2 Latin1, 32-bit wchar, 32-bit Do you really care of these corner cases? See the more the realistic benchmark in previous Martin's email ("PEP 393 memory savings update"): the PEP 393 not only uses 3x less memory than 3.2, but it uses also *less* memory than Python 2.7, whereas Python 3 uses Unicode for everything! > For larger strings, I expect the savings to be more significant. Sure. > OTOH, a single non-BMP code point in such a string would cause > the savings to drop significantly again. In this case, it's just as good as Python 3.2 in wide mode, but worse than 3.2 in narrow mode. But is it a real use case? If you want a really efficient storage for heterogeneous strings (mixing ASCII, Latin1, BMP and non-BMP), you can split the text into chunks. For example, I hope that a text processor like LibreOffice doesn't store all paragraphs in the same string, but create at least a string per paragraph. If you use short chunks, you will not notice the difference in memory footprint when you insert a non-BMP character. The trick doesn't work on Python < 3.3. > For best performance, each algorithm will have to be implemented > for all three storage types. ... Good performances can be archived using PyUnicode macros like PyUnicode_READ and PyUnicode_WRITE. But yes, if you want a super-fast Unicode processor, you can special case some kinds (UCS1, UCS2, UCS4), like the examples I described before (use memset for latin1). > ... Not doing so, will result in a slow-down, if I read the PEP > correctly. I don't think so. Browse the new unicodeobject.c, there are few switch/case on the kind (if you ignore the low-level functions like _PyUnicode_Ready). For example, unicode_isalpha() has only one implementation, using PyUnicode_READ. PyUnicode_READ doesn't use a switch but classic (fast) arithmetic on pointers. > It's difficult to say, of what scale, since that > information is not given in the PEP, but the added loop over > the complete data array in order to determine the maximum > code point value suggests that it is significant. Feel free to run yourself Antoine's benchmarks like stringbench and iobench, they do micro-benchmarks. But you have to know that very few codecs use the new Unicode API (I think that only UTF-8 encoder and decoder use the new API, maybe also the ASCII codec). > I am not convinced that the memory savings are big enough > to warrant the performance penalty and added complexity > suggested by the PEP. I didn't run any benchmark, but I don't think that the PEP 393 makes Python slower. I expect a minor speedup in some corner cases :-) I prefer to wait until all modules are converted to the new API to run benchmarks. TODO: unicodedata, _csv, all codecs (especially error handlers), ... > In practice, using a UCS2 build of Python usually is a good > compromise between memory savings, performance and standards > compatibility About "standards compatibility", the work to support non-BMP characters everywhere was not finished in Python 3.2, 11 years after the introduction of Unicode in Python (2.0). Using the new API, non-BMP characters will be supported for free, everywhere (especially in *Python*, "\U0010FFFF"[0] and len("\U0010FFFF") doesn't give surprising results anymore). With the addition of emoticon in a non-BMP range in Unicode 6, non-BMP characters will become more and more common. Who doesn't like emoticon? :-) o;-) >< (no, I will no add non-BMP characters in this email, I don't want to crash your SMTP server and mail client) > IMHO, Python should be optimized for UCS2 usage With the PEP 393, it's better: Python is optimize for any usage! (but I expect it to be faster in the Latin1 range, U+0000-U+00FF) > I do see the advantage for large strings, though. A friend reads last Martin's benchmark differently: Python 3.2 uses 3x more memory than Python 2! Can I say that the PEP 393 fixed an huge regression of Python 3? > Given that I've been working on and maintaining the Python Unicode > implementation actively or by providing assistance for almost > 12 years now, I've also thought about whether it's still worth > the effort. Thanks for your huge work on Unicode, Marc-Andre! > My interests have shifted somewhat into other directions and > I feel that helping Python reach world domination in other ways > makes me happier than fighting over Unicode standards, implementations, > special cases that aren't special enough, and all those other > nitty-gritty details that cause long discussions :-) Someone said that we still need to define what a character is! By the way, what is a code point? > So I feel that the PEP 393 change is a good time to draw a line > and leave Unicode maintenance to Ezio, Victor, Martin, and > all the others that have helped over the years. I know it's > in good hands. I don't understand why you would like to stop contribution to Unicode, but well, as you want. We will try to continue your work. Victor _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com