Hi,

I propose adding a new C API to "build an Unicode string". What do you
think? Would it be efficient with any possible Unicode string storage
and any Python implementation?

PyPy has an UnicodeBuilder type in Python, but here I only propose C
API. Later, if needed, it would be easy to add a Python API for it.
PyPy has UnicodeBuilder to replace "str += str" pattern which is
inefficient in PyPy: CPython has a micro-optimization (in ceval.c) to
keep this pattern performance interesting. Adding a Python API was
discussed in 2020, see the LWN article:
https://lwn.net/Articles/816415/

Example without error handling, naive implementation which doesn't use
known length of key and value strings (calling Preallocate may be more
efficient):
---------------------------
    // Format "key=value"
    PyObject *format_with_builder(PyObject *key, PyObject *value)
    {
        assert(PyUnicode_Check(key));
        assert(PyUnicode_Check(value));

        // Allocated on the stack
        PyUnicodeBuilder builder;
        PyUnicodeBuilder_Init(&builder);

        //  Overallocation is more efficient if the final length is unknown
        PyUnicodeBuilder_EnableOverallocation(&builder);
        PyUnicodeBuilder_WriteStr(&builder, key);
        PyUnicodeBuilder_WriteChar(&builder, '=');

        // Disable overallocation before the last write
        PyUnicodeBuilder_DisableOverallocation(&builder);
        PyUnicodeBuilder_WriteStr(&builder, value);

        PyUnicode *str = PyUnicodeBuilder_Finish(&builder);
        // ... use str ...
        return

    error:
        PyUnicodeBuilder_Clear(&builder);
        ...
    }
---------------------------

Proposed API (11 functions, 1 type):
---------------------------
typedef struct { ... } PyUnicodeBuilder;

void PyUnicodeBuilder_Init(PyUnicodeBuilder *builder);

int PyUnicodeBuilder_Preallocate(PyUnicodeBuilder *builder,
    Py_ssize_t length, uint32_t maxchar);

void PyUnicodeBuilder_EnableOverallocation(PyUnicodeBuilder *builder);
void PyUnicodeBuilder_DisableOverallocation(PyUnicodeBuilder *builder);

int PyUnicodeBuilder_WriteChar(PyUnicodeBuilder *builder, uint32_t ch);
int PyUnicodeBuilder_WriteStr(PyUnicodeBuilder *builder, PyObject *str);
int PyUnicodeBuilder_WriteSubstr(PyUnicodeBuilder *builder,
    PyObject *str, Py_ssize_t start, Py_ssize_t end);

int PyUnicodeBuilder_WriteASCIIStr(PyUnicodeBuilder *builder,
    const char *str, Py_ssize_t len);
int PyUnicodeBuilder_WriteLatin1Str(PyUnicodeBuilder *builder,
    const char *str, Py_ssize_t len);

PyObject* PyUnicodeBuilder_Finish(PyUnicodeBuilder *builder);
void PyUnicodeBuilder_Clear(PyUnicodeBuilder *builder);
---------------------------

The proposed API is based on the private _PyUnicodeWriter C API that I
added in Python 3.3 to optimize PEP 393 implementation.

PyUnicodeBuilder_WriteASCIIStr() is an optimization: in release mode,
the function doesn't have to check if the string contains non-ASCII
characters. In debug mode, it must fail. If you consider that this API
is too likely to introduce bugs in release mode, it can be removed.

PyUnicodeBuilder_Preallocate() maxchar can be zero, but for the
current Python implementation (PEP 393 compact string), it's more
efficient if maxchar matchs the expected Unicode storage: 127 for
ASCII, 255 for Latin1, 0xffff for UCS2, or 0x10ffff for UCS4. The
value doesn't have to the exact, for example, it can be any valiue in
[128; 255] for Latin1. The problem is that computing maxchar (need to
read characters, decode a byte strings from a codec, etc.) can be
expensive and an PyUnicodeBuilder_Preallocate() implementation can
ignore maxchar depending on the chosen Unicode string storage.
PyUnicode_MAX_CHAR_VALUE(str) can be used to create maxchar, but this
function is specific to the current CPython implementation.

Maybe a second "preallocate" function without maxchar should be added
(more convenient, but less efficient). I don't know.

Rationale for adding a new public C API.

Currently, the Python C API is designed to allocate an Unicode string
on the heap memory with uninitialized characters, and then basically
gives a direct access to these characters. Since Python 3.3, creating
an Unicode string in a C extension became more complicated: the caller
must know in advance what will be the optimal storage for characters:
ASCII, Latin1 UCS1 [U+0000; U+00FF], BMP UCS-2 [U+0000; U+FFFF],or
full Unicode character set UCS4 [U+0000; U+10FFFF]. When writing a
codec decoder (like decoding UTF-8), the maximum code point is not
known in advance and so the decoder may need to change the buffer
format while decoding (start with UCS1, switch to UCS2, maybe switch a
second time to UCS4).

The current C API has multiple drawbacks:

* It is designed to target the exact format "PEP 393 compact strings"
("kind + data").
* It is inefficient for PyPy which uses UTF-8 internally. So it would
also be inefficient if Python is modified to also use UTF-8
internally.
* It leaks too many implementation details.
* It creates an uninitialized string which might be exposed by mistake
to Python and so can lead to bugs or even crashes

I propose adding a new API to "build a string" which would be
efficient on CPython and PyPy. Later, it should help Python
experimenting a different storage for Unicode strings (with different
trade-offs, like UTF-8).

Discussion about changing maybe the Unicode storage in Python
tomorrow, especially issues caused by the C API which prevent that:

* 
https://discuss.python.org/t/un-deprecate-pyunicode-ready-for-future-unicode-improvement/15717
* https://github.com/python/cpython/pull/92705#issuecomment-1125869198

In 2016, I wrote an article about private _PyUnicodeWriter and
_PyBytesWriter C API that I added to optimize Python:
https://vstinner.github.io/pybyteswriter.html

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/6SDAWEE3UERXRJ7S7GWDR3SDSMMDDLJK/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to