New submission from Inada Naoki <songofaca...@gmail.com>:

Assume you are writing an extension module that reads string.  For example, 
HTML escape or JSON encode.

There are two courses:

(a) Support three KINDs in the flexible unicode representation.
(b) Get UTF-8 data from the unicode.

(a) will be the fastest on CPython, but there are few drawbacks:

 * This is tightly coupled with CPython implementation.  It will be slow on 
PyPy.
 * CPython may change the internal representation to UTF-8 in the future, like 
PyPy.
 * You can not easily reuse algorithms written in C that handle `char*`.

So I believe (b) should be the preferred way.
But CPython doesn't provide an efficient way to get UTF-8 from the unicode 
object.

 * PyUnicode_AsUTF8AndSize(): When the unicode contains non-ASCII character, it 
will create a UTF-8 cache.  The cache will be remained for longer than 
required.  And there is additional malloc + memcpy to create the cache.

 * PyUnicode_DecodeUTF8(): It creates bytes object even when the unicode object 
is ASCII-only or there is a UTF-8 cache already.

For speed and efficiency, I propose a new API:

```
  /* Borrow the UTF-8 C string from the unicode.
   *
   * Store a pointer to the UTF-8 encoding of the unicode to *utf8* and its 
size to *size*.
   * The returned object is the owner of the *utf8*.  You need to Py_DECREF() 
it after
   * you finished to using the *utf8*.  The owner may be not the unicode.
   * Returns NULL when the error occurred while decoding the unicode.
   */
  PyObject* PyUnicode_BorrowUTF8(PyObject *unicode, const char **utf8, 
Py_ssize_t *len);
```

When the unicode object is ASCII or has UTF-8 cache, this API increment refcnt 
of the unicode and return it.
Otherwise, this API calls `_PyUnicode_AsUTF8String(unicode, NULL)` and return 
it.

----------
components: C API
messages: 358623
nosy: inada.naoki
priority: normal
severity: normal
status: open
title: No efficient API to get UTF-8 string from unicode object.
type: enhancement
versions: Python 3.9

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue39087>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to