[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

M.-A. Lemburg Wed, 01 Jul 2020 13:30:18 -0700

On 28.06.2020 16:24, Inada Naoki wrote:
> Hi, Lamburg.
> 
> Thank you for quick response.
> 
>>
>> We can't just remove access to one half of a codec (the decoding
>> part) without at least providing an alternative for C extensions
>> to use.
>>
>> Py_UNICODE can be removed from the API, but only if there are
>> alternative APIs which C extensions can use to the same effect.
>>
>> Given PEP 393, this would be APIs which use wchar_t instead of
>> Py_UNICODE.
>>
> 
> Decoding part is implemented as `const char *` -> `PyObject*` (Unicode 
> object).
> I think this is reasonable since `const char *` is perfect to abstract
> the encoded string,
> 
> In case of encoding part, `wchar_t *` is not perfect abstraction for
> (decoded) unicode
> string.


Note that the PyUnicode_Encode*() APIs are meant to be make the
codec's encoding machinery available to C extensions, so that they
don't have to implement this again.

In that sense, their purpose is not to encode an existing Unicode
object, but instead, to provide access to the low-level buffer to
bytes object encoding.

The reasoning here is the same as for decoding: you have the original
data you want to process available in some array and want to turn
this into the Python object.

The path Victor suggested requires always going via a Python Unicode
object, but that it very expensive and not really an appropriate
way to address the use case.

As an example application, think of a database module which provides
the Unicode data as Py_UNICODE buffer. You want to write this as UTF-8
data to a file or a socket, so you have the PyUnicode_EncodeUTF8() API
decode this for you into a bytes object which you can then write out
using the Python C APIs for this.

>  Converting from Unicode object into `wchar_t *` is not zero-cost.
> I think `PyObject *` (Unicode object) -> `PyObject *` (bytes object)
> looks better signature than
> `wchar_t *` -> `Pyobject *` (bytes object) because for encoders.

See above. The motivation for these APIs is different. They are
not about taking a Unicode object and converting it into bytes,
they are deliberately about taking a data buffer as input and
producing the Python bytes object as output (to implement symmetry
between decoding and encoding).

> * Unicode object is more important than `wchar_t *` in Python.

Right, but as I tried to explain in my reply to Victor, I designed
the Unicode API in Python to be a rich API, which provides all
necessary tools to easily work with Unicode in C extensions as
well as in the CPython interpreter.

The API is not only focused on what the CPython interpreter needs.
It's an API which implements a concise interface to Unicode as
used in Python.

> * All PyUnicode_EncodeXXX APIs are implemented with PyUnicode_FromWideChar.
> 
> For example, we have these private encode APIs:
> 
> * PyObject* _PyUnicode_AsAsciiString(PyObject *unicode, const char *errors)
> * PyObject* _PyUnicode_AsLatin1String(PyObject *unicode, const char *errors)
> * PyObject* _PyUnicode_AsUTF8String(PyObject *unicode, const char *errors)
> * PyObject* _PyUnicode_EncodeUTF16(PyObject *unicode, const char
> *errors, int byteorder)
> ...
> 
> So how about making them public, instead of undeprecate Py_UNICODE* encode 
> APIs?

I'd be fine with keeping just a generic PyUnicode_Encode() API,
but this should then be encoding from a buffer to a bytes object.

The above all take Unicode objects as input and create the same
problem as I described above, with the temporary Unicode object being
created and all the associated malloc and scanning overhead needed
for this.

The reason I mention wchar_t as new basis for the PyUnicde_Encode()
API is that whcar_t has grown to be accepted as the standard for
Unicode buffers in C. If you don't believe that this is good enough,
we could also force Py_UCS4, but this would alienate Windows extension
writers.

> 1. Add PyUnicode_AsXXXBytes public APIs in Python 3.10.
>    Current private APIs can become macro (e.g. #define
> _PyUnicode_AsAsciiString PyUnicode_AsAsciiBytes),
>    or deprecated static inline function.
> 2. Remove Py_UNICODE* encode APIs in Python 3.12.

FWIW: I don't object to deprecating Py_UNICODE. I just don't
want to lose the symmetry in decoding/encoding and add the cost
of having to go via a Python Unicode object just to decode
to bytes.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...           http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...           http://zope.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
                      http://www.malemburg.com/
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/CQ4WLI7JHIAZ3JKYTZWVQVSLZLMRUBJG/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

Reply via email to