Am 25.01.2011 12:08, schrieb Nick Coghlan: > On Tue, Jan 25, 2011 at 6:17 AM, "Martin v. Löwis" <mar...@v.loewis.de> wrote: >> A new function PyUnicode_AsUTF8 is provided to access the UTF-8 >> representation. It is thus identical to the existing >> _PyUnicode_AsString, which is removed. The function will compute the >> utf8 representation when first called. Since this representation will >> consume memory until the string object is released, applications >> should use the existing PyUnicode_AsUTF8String where possible >> (which generates a new string object every time). API that implicitly >> converts a string to a char* (such as the ParseTuple functions) will >> use this function to compute a conversion. > > I'm not entirely clear as to what "this function" is referring to here.
PyUnicode_AsUTF8 (i.e. the one where you don't need to release the memory). I made this explicit now. > I'm also dubious of the "PyUnicode_Finalize" name - "PyUnicode_Ready" > might be a better option (PyType_Ready seems a better analogy for a > "I've filled everything in, please calculate the derived fields now" > than Py_Finalize). Ok, changed (when I was pondering about this PEP, this once occurred me also, but I forgot when I typed it in). > > More generally, let me see if I understand the proposed structure correctly: > > str: Always set once PyUnicode_Ready() has been called. > Always points to the canonical representation of the string (as > indicated by PyUnicode_Kind) > length: Always set once PyUnicode_Ready() has been called. Specifies > the number of code points in the string. Correct. > wstr: Set only if PyUnicode_AsUnicode has been called on the string. Might also be set when the string is created through PyUnicode_FromUnicode was used, and PyUnicode_Ready hasn't been called. > If (sizeof(wchar_t) == 2 && PyUnicode_Kind() == PyUnicode_2BYTE) > or (sizeof(wchar_t) == 4 && PyUnicode_Kind() == PyUnicode_4BYTE), wstr > = str, otherwise wstr points to dedicated memory > wstr_length: Valid only if wstr != NULL > If wstr_length != length, indicates presence of surrogate pairs in > a UCS-2 string (i.e. sizeof(wchar_t) == 2, PyUnicode_Kind() == > PyUnicode_4BYTE). Correct. > utf8: Set only if PyUnicode_AsUTF8 has been called on the string. > If string contents are pure ASCII, utf8 = str, otherwise utf8 > points to dedicated memory. > utf8_length: Valid only if utf8_ptr != NULL Correct. > One change I would propose is that rather than hiding flags in the low > order bits of the str pointer, we expand the use of the existing > "state" field to cover the representation information in addition to > the interning information. Thanks for the idea; done. > I would also suggest explicitly flagging > internally whether or not a 1 byte string is ASCII or Latin-1 along > the lines of: Not sure about that. It would complicate PyUnicode_Kind. Instead, I'd rather fill out utf8 right away if we can use sharing (e.g. when the string is created with a max value <128, or PyUnicode_Ready has determined that). So I keep it for the moment as reserved (but would use it when str is NULL, as I'd have to fill in some value, anyway). Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com