On 26/12/2020 10:52, Ronald Oussoren via Python-Dev wrote:
On 25 Dec 2020, at 23:03, Nelson, Karl E. via Python-Dev
<python-dev@python.org> wrote:
I was directed to post this request to the general Python development
community so hopefully this is on topic.
One of the weaknesses of the PyUnicode implementation is that the type
is concrete and there is no option for an abstract proxy string to a
foreign source. This is an issue for an API like JPype in which
java.lang.Strings are passed back from Java. Ideally these would be
a type derived from the Unicode type str, but that requires
transferring the memory immediately from Java to Python even when that
handle is large and will never be accessed from within Python. For
certain operations like XML parsing this can be prohibitable, so
instead of returning a str we return a JString. (There is a separate
issue that Java method names and Python method names conflict so
direct inheritance creates some problems.)
The JString type can of course be transferred to Python space at any
time as both Python Unicode and Java string objects are immutable.
However the CPython API which takes strings only accepts the Unicode
type objects which have a concrete implementation. It is possible to
extend strings, but those extensions do not allow for proxing as far
as I can tell. Thus there is no option currently to proxy to a string
representation in another language. The concept of the using the duck
type ``__str__`` method is insufficient as this indices that an object
can become a string, rather than “this object is effectively a string”
for the purposes of the CPython API.
One way to address this is to use currently outdated copy of READY to
extend Unicode objects to other languages. A class like JString would
be an unready Unicode object which when READY is called transfers the
memory from Java, sets up the flags and sets up a pointer to the code
point representation. Unfortunately the READY concept is scheduled
for removal and thus the chance to address the needs for proxying a
Unicode to another languages representation may be limited. There may
be other methods to accomplish this without using the concept of
READY. So long as access to the code points go through the Unicode
API and the Unicode object can be extended such that the actual code
points may be located outside of the Unicode object then a proxy can
still be achieved if there are hooks in it to decided when a transfer
should be performed. Generally the transfer request only needs to
happen once but the key issue being that the number of code points
(nor the kind of points) will not be known until the memory is
transferred.
Java has much the same problem. Although they defined an interface
class “java.lang.CharacterArray” the actually “java.lang.String” class
is concrete and almost all API methods take a String rather than the
base interface even when the base interface would have been adequate.
Thus just like Python has difficulty treating a foreign string class
as it would a native one, Java cannot treat a Python string as native
one as well. So Python strings get represented as CharacterArray type
which effectively limits it use greatly.
Summary:
A String proxy would need the address of the memory in the “wstr” slot
though the code points may be char[], wchar[] or int[] depending the
representation in the proxy.
API calls to interpret the data would need to check to see if the data
is transferred first, if not it would call the proxy dependent
transfer method which is responsible for creating a block of code
points and set up flags (kind, ascii, ready, and compact).
The memory block allocated would need to call the proxy dependent
destructor to clean up with the string is done.
It is not clear if this would have impact on performance. Python
already has the concept of a string which needs actions before it can
be accessed, but this is scheduled for removal.
Are there any plans currently to address the concept of a proxy string
in PyUnicode API?
I have a similar problem in PyObjC which proxies Objective-C classes
to Python (and the other way around). For interop with Python code I
proxy Objective-C strings using a subclass of str() that is eagerly
populated even if, as you mention as well, a lot of these proxy object
are never used in a context where the str() representation is
important. A complicating factor for me is that Objective-C strings
are, in general, mutable which can lead to interesting behaviour.
Another disadvantage of subclassing str() for foreign string types is
that this removes the proxy class from their logical location in the
class hierarchy (in my case the proxy type is not a subclass of the
proxy type for NSObject, even though all Objective-C classes inherit
from NSObject).
I primarily chose to subclass the str type because that enables using
the NSString proxy type with C functions/methods that expect a string
argument. That might be something that can be achieved using a new
protocol, similar to operator.index of os.fspath. A complicating
factor here is there’s a significant amount of Python code as well
that explicitly tests for the str type to exclude strings from code
paths that iterate over containers.
Just to add another use case...
PyQt (the Python bindings for Qt) has a similar issue. Qt implements
unicode strings as a QString class which uses UTF-16 as the "native"
representation. Currently PyQt converts between Python unicode objects
and QString instances as and when required. While this might sound
inefficient I've never had a report saying that this was actually a
problem in a particular situation - but it would be nice to avoid it if
possible.
It's worth comparing the situation with byte arrays. There is no problem
of translating different representations of an element, but there is
still the issue of who owns the memory. The Python buffer protocol
usually solves this problem, so something similar for unicode "arrays"
might suffice.
Phil
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/6SVDY4E7ASGTMVGPSBT2A7RZBVU53SZU/
Code of Conduct: http://python.org/psf/codeofconduct/