[issue35295] Please clarify whether PyUnicode_AsUTF8AndSize() or PyUnicode_AsUTF8String() is preferred

2021-02-05 Thread Marcin Kowalczyk


Marcin Kowalczyk  added the comment:

Thank you!

This means that I can continue to use PyUnicode_AsUTF8AndSize() without 
worries: 
https://github.com/google/riegeli/commit/17ab36bfdd6cc55f37cfbb729bd43c9cbff4cd22

--

___
Python tracker 
<https://bugs.python.org/issue35295>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35295] Please clarify whether PyUnicode_AsUTF8AndSize() or PyUnicode_AsUTF8String() is preferred

2018-11-22 Thread Marcin Kowalczyk


New submission from Marcin Kowalczyk :

The documentation is silent whether PyUnicode_AsUTF8AndSize() or 
PyUnicode_AsUTF8String() is preferred.

We are under the assumption that both are acceptable for the given caller, i.e. 
the caller wants to access just the sequence of UTF-8 code units (e.g. for 
calling a C++ function which takes std::string_view or std::string as a 
parameter), and the caller either will copy the UTF-8 code units immediately or 
is willing to own a temporary object to ensure a lifetime of the UTF-8 code 
units.

File comments in unicodeobject.h about PyUnicode_AsUTF8AndSize() have a warning:

   *** This API is for interpreter INTERNAL USE ONLY and will likely
   *** be removed or changed in the future.
   *** If you need to access the Unicode object as UTF-8 bytes string,
   *** please use PyUnicode_AsUTF8String() instead.

The discrepancy between these comments and the documentation should be fixed. 
Either the documentation is correct and the comment is outdated, or the comment 
is correct and the documentation is lacking guidance.

It is not even clear which function is better technically:

- PyUnicode_AsUTF8String() always allocates the string. 
PyUnicode_AsUTF8AndSize() does not allocate the string if the unicode object is 
ASCII-only (this is common) or if PyUnicode_AsUTF8AndSize() was already called 
before.

- If conversion must be performed, then PyUnicode_AsUTF8String() makes a single 
allocation, while PyUnicode_AsUTF8AndSize() first calls 
PyUnicode_AsUTF8String() and then copies the string.

- If the string is converted multiple times, then PyUnicode_AsUTF8AndSize() 
caches the result - faster. If the string is converted once, then the result 
persists as long as the string persists - wastes memory.

I see the following possible resolutions:

1a. Declare both functions equally acceptable. Remove comments claiming that 
PyUnicode_AsUTF8AndSize() should be avoided.

1b. 1a, and change the implementation of PyUnicode_AsUTF8AndSize() to avoid 
allocating the string twice if it needs to be materialized, so that 
PyUnicode_AsUTF8AndSize() is never significantly slower than 
PyUnicode_AsUTF8String().

2a. Declare PyUnicode_AsUTF8String() preferred. Indicate this in the 
documentation.

2b. 2a, and provide a public interface to check and access UTF-8 code units 
without allocating a new string in case this is possible (I think 
PyUnicode_READY() + PyUnicode_IS_ASCII() + PyUnicode_DATA() + 
PyUnicode_GET_LENGTH() would work, but they are not documented; or possibly 
also check if the string has a cached UTF-8 representation without populating 
that cached representation), so that a combination of the check with 
PyUnicode_AsUTF8String() is rarely or never significantly slower than 
PyUnicode_AsUTF8AndSize().

--
assignee: docs@python
components: Documentation, Interpreter Core, Unicode
messages: 330249
nosy: Marcin Kowalczyk, docs@python, ezio.melotti, vstinner
priority: normal
severity: normal
status: open
title: Please clarify whether PyUnicode_AsUTF8AndSize() or 
PyUnicode_AsUTF8String() is preferred
type: performance
versions: Python 3.8

___
Python tracker 
<https://bugs.python.org/issue35295>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com