> Codecs use resizing a lot. Given that PyCompactUnicodeObject > does not support resizing, most decoders will have to use > PyUnicodeObject and thus not benefit from the memory footprint > advantages of e.g. PyASCIIObject.
No, codecs have been rewritten to not use resizing. > PyASCIIObject has a wchar_t *wstr pointer - I guess this should > be a char *str pointer, otherwise, where's the memory footprint > advantage (esp. on Linux where sizeof(wchar_t) == 4) ? That's the Py_UNICODE representation for backwards compatibility. It's normally NULL. > I also don't see a reason to limit the UCS1 storage version > to ASCII. Accordingly, the object should be called PyLatin1Object > or PyUCS1Object. No, in the ASCII case, the UTF-8 length can be shared with the regular string length - not so for Latin-1 character above 127. > Typedef'ing Py_UNICODE to wchar_t and using wchar_t in existing > code will cause problems on some systems where whcar_t is a > signed type. > > Python assumes that Py_UNICODE is unsigned and thus doesn't > check for negative values or takes these into account when > doing range checks or code point arithmetic. > > On such platform where wchar_t is signed, it is safer to > typedef Py_UNICODE to unsigned wchar_t. No. Py_UNICODE values *must* be in the range 0..17*2**16. Values larger than 17*2**16 are just as bad as negative values, so having Py_UNICODE unsigned doesn't improve anything. > Py_UNICODE access to the objects assumes that len(obj) == > length of the Py_UNICODE buffer. The PEP suggests that length > should not take surrogates into account on UCS2 platforms > such as Windows. The causes len(obj) to not match len(wstr). Correct. > As a result, Py_UNICODE access to the Unicode objects breaks > when surrogate code points are present in the Unicode object > on UCS2 platforms. Incorrect. What specifically do you think would break? > The PEP also does not explain how lone surrogates will be > handled with respect to the length information. Just as any other code point. Python does not special-case surrogate code points anymore. > Furthermore, determining len(obj) will require a loop over > the data, checking for surrogate code points. A simple memcpy() > is no longer enough. No, it won't. The length of the Unicode object is stored in the length field. > I suggest to drop the idea of having len(obj) not count > wstr surrogate code points to maintain backwards compatibility > and allow for working with lone surrogates. Backwards-compatibility is fully preserved by PyUnicode_GET_SIZE returning the size of the Py_UNICODE buffer. PyUnicode_GET_LENGTH returns the true length of the Unicode object. > Note that the whole surrogate debate does not have much to > do with this PEP, since it's mainly about memory footprint > savings. I'd also urge to do a reality check with respect > to surrogates and non-BMP code points: in practice you only > very rarely see any non-BMP code points in your data. Making > all Python users pay for the needs of a tiny fraction is > not really fair. Remember: practicality beats purity. That's the whole point of the PEP. You only pay for what you actually need, and in most cases, it's ASCII. > For best performance, each algorithm will have to be implemented > for all three storage types. This will be a trade-off. I think most developers will be happy with a single version covering all three cases, especially as it's much more maintainable. Kind regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com