On 28.06.2020 16:24, Inada Naoki wrote: > Hi, Lamburg. > > Thank you for quick response. > >> >> We can't just remove access to one half of a codec (the decoding >> part) without at least providing an alternative for C extensions >> to use. >> >> Py_UNICODE can be removed from the API, but only if there are >> alternative APIs which C extensions can use to the same effect. >> >> Given PEP 393, this would be APIs which use wchar_t instead of >> Py_UNICODE. >> > > Decoding part is implemented as `const char *` -> `PyObject*` (Unicode > object). > I think this is reasonable since `const char *` is perfect to abstract > the encoded string, > > In case of encoding part, `wchar_t *` is not perfect abstraction for > (decoded) unicode > string.
Note that the PyUnicode_Encode*() APIs are meant to be make the codec's encoding machinery available to C extensions, so that they don't have to implement this again. In that sense, their purpose is not to encode an existing Unicode object, but instead, to provide access to the low-level buffer to bytes object encoding. The reasoning here is the same as for decoding: you have the original data you want to process available in some array and want to turn this into the Python object. The path Victor suggested requires always going via a Python Unicode object, but that it very expensive and not really an appropriate way to address the use case. As an example application, think of a database module which provides the Unicode data as Py_UNICODE buffer. You want to write this as UTF-8 data to a file or a socket, so you have the PyUnicode_EncodeUTF8() API decode this for you into a bytes object which you can then write out using the Python C APIs for this. > Converting from Unicode object into `wchar_t *` is not zero-cost. > I think `PyObject *` (Unicode object) -> `PyObject *` (bytes object) > looks better signature than > `wchar_t *` -> `Pyobject *` (bytes object) because for encoders. See above. The motivation for these APIs is different. They are not about taking a Unicode object and converting it into bytes, they are deliberately about taking a data buffer as input and producing the Python bytes object as output (to implement symmetry between decoding and encoding). > * Unicode object is more important than `wchar_t *` in Python. Right, but as I tried to explain in my reply to Victor, I designed the Unicode API in Python to be a rich API, which provides all necessary tools to easily work with Unicode in C extensions as well as in the CPython interpreter. The API is not only focused on what the CPython interpreter needs. It's an API which implements a concise interface to Unicode as used in Python. > * All PyUnicode_EncodeXXX APIs are implemented with PyUnicode_FromWideChar. > > For example, we have these private encode APIs: > > * PyObject* _PyUnicode_AsAsciiString(PyObject *unicode, const char *errors) > * PyObject* _PyUnicode_AsLatin1String(PyObject *unicode, const char *errors) > * PyObject* _PyUnicode_AsUTF8String(PyObject *unicode, const char *errors) > * PyObject* _PyUnicode_EncodeUTF16(PyObject *unicode, const char > *errors, int byteorder) > ... > > So how about making them public, instead of undeprecate Py_UNICODE* encode > APIs? I'd be fine with keeping just a generic PyUnicode_Encode() API, but this should then be encoding from a buffer to a bytes object. The above all take Unicode objects as input and create the same problem as I described above, with the temporary Unicode object being created and all the associated malloc and scanning overhead needed for this. The reason I mention wchar_t as new basis for the PyUnicde_Encode() API is that whcar_t has grown to be accepted as the standard for Unicode buffers in C. If you don't believe that this is good enough, we could also force Py_UCS4, but this would alienate Windows extension writers. > 1. Add PyUnicode_AsXXXBytes public APIs in Python 3.10. > Current private APIs can become macro (e.g. #define > _PyUnicode_AsAsciiString PyUnicode_AsAsciiBytes), > or deprecated static inline function. > 2. Remove Py_UNICODE* encode APIs in Python 3.12. FWIW: I don't object to deprecating Py_UNICODE. I just don't want to lose the symmetry in decoding/encoding and add the cost of having to go via a Python Unicode object just to decode to bytes. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/CQ4WLI7JHIAZ3JKYTZWVQVSLZLMRUBJG/ Code of Conduct: http://python.org/psf/codeofconduct/