[Python-Dev] Re: Enhancement request for PyUnicode proxies

2020-12-26 Thread Ronald Oussoren via Python-Dev

> On 25 Dec 2020, at 23:03, Nelson, Karl E. via Python-Dev 
>  wrote:
> 
> I was directed to post this request to the general Python development 
> community so hopefully this is on topic.
>  
> One of the weaknesses of the PyUnicode implementation is that the type is 
> concrete and there is no option for an abstract proxy string to a foreign 
> source.  This is an issue for an API like JPype in which java.lang.Strings 
> are passed back from Java.   Ideally these would be a type derived from the 
> Unicode type str, but that requires transferring the memory immediately from 
> Java to Python even when that handle is large and will never be accessed from 
> within Python.  For certain operations like XML parsing this can be 
> prohibitable, so instead of returning a str we return a JString.   (There is 
> a separate issue that Java method names and Python method names conflict so 
> direct inheritance creates some problems.)
>  
> The JString type can of course be transferred to Python space at any time as 
> both Python Unicode and Java string objects are immutable.  However the 
> CPython API which takes strings only accepts the Unicode type objects which 
> have a concrete implementation.  It is possible to extend strings, but those 
> extensions do not allow for proxing as far as I can tell.  Thus there is no 
> option currently to proxy to a string representation in another language.  
> The concept of the using the duck type ``__str__`` method is insufficient as 
> this indices that an object can become a string, rather than “this object is 
> effectively a string” for the purposes of the CPython API.
>  
> One way to address this is to use currently outdated copy of READY to extend 
> Unicode objects to other languages.  A class like JString would be an unready 
> Unicode object which when READY is called transfers the memory from Java, 
> sets up the flags and sets up a pointer to the code point representation.  
> Unfortunately the READY concept is scheduled for removal and thus the chance 
> to address the needs for proxying a Unicode to another languages 
> representation may be limited. There may be other methods to accomplish this 
> without using the concept of READY.  So long as access to the code points go 
> through the Unicode API and the Unicode object can be extended such that the 
> actual code points may be located outside of the Unicode object then a proxy 
> can still be achieved if there are hooks in it to decided when a transfer 
> should be performed.   Generally the transfer request only needs to happen 
> once  but the key issue being that the number of code points (nor the kind of 
> points) will not be known until the memory is transferred.
>  
> Java has much the same problem.   Although they defined an interface class 
> “java.lang.CharacterArray” the actually “java.lang.String” class is concrete 
> and almost all API methods take a String rather than the base interface even 
> when the base interface would have been adequate.  Thus just like Python has 
> difficulty treating a foreign string class as it would a native one, Java 
> cannot treat a Python string as native one as well.  So Python strings get 
> represented as CharacterArray type which effectively limits it use greatly.
>  
> Summary:
>  
> A String proxy would need the address of the memory in the “wstr” slot though 
> the code points may be char[], wchar[] or int[] depending the representation 
> in the proxy.
> API calls to interpret the data would need to check to see if the data is 
> transferred first, if not it would call the proxy dependent transfer method 
> which is responsible for creating a block of code points and set up flags 
> (kind, ascii, ready, and compact). 
> The memory block allocated would need to call the proxy dependent destructor 
> to clean up with the string is done.
> It is not clear if this would have impact on performance.   Python already 
> has the concept of a string which needs actions before it can be accessed, 
> but this is scheduled for removal.
>  
> Are there any plans currently to address the concept of a proxy string in 
> PyUnicode API?  

I have a similar problem in PyObjC which proxies Objective-C classes to Python 
(and the other way around). For interop with Python code I proxy Objective-C 
strings using a subclass of str() that is eagerly populated even if, as you 
mention as well, a lot of these proxy object are never used in a context where 
the str() representation is important.  A complicating factor for me is that 
Objective-C strings are, in general, mutable which can lead to interesting 
behaviour.Another disadvantage of subclassing str() for foreign string 
types is that this removes the proxy class from their logical location in the 
class hierarchy (in my case the proxy type is not a subclass of the proxy type 
for NSObject, even though all Objective-C classes inherit from NSObject).

I primarily chose to subclass the str type because that enables

[Python-Dev] Re: Enhancement request for PyUnicode proxies

2020-12-26 Thread Phil Thompson via Python-Dev

On 26/12/2020 10:52, Ronald Oussoren via Python-Dev wrote:
On 25 Dec 2020, at 23:03, Nelson, Karl E. via Python-Dev 
 wrote:


I was directed to post this request to the general Python development 
community so hopefully this is on topic.


One of the weaknesses of the PyUnicode implementation is that the type 
is concrete and there is no option for an abstract proxy string to a 
foreign source.  This is an issue for an API like JPype in which 
java.lang.Strings are passed back from Java.   Ideally these would be 
a type derived from the Unicode type str, but that requires 
transferring the memory immediately from Java to Python even when that 
handle is large and will never be accessed from within Python.  For 
certain operations like XML parsing this can be prohibitable, so 
instead of returning a str we return a JString.   (There is a separate 
issue that Java method names and Python method names conflict so 
direct inheritance creates some problems.)


The JString type can of course be transferred to Python space at any 
time as both Python Unicode and Java string objects are immutable.  
However the CPython API which takes strings only accepts the Unicode 
type objects which have a concrete implementation.  It is possible to 
extend strings, but those extensions do not allow for proxing as far 
as I can tell.  Thus there is no option currently to proxy to a string 
representation in another language.  The concept of the using the duck 
type ``__str__`` method is insufficient as this indices that an object 
can become a string, rather than “this object is effectively a string” 
for the purposes of the CPython API.


One way to address this is to use currently outdated copy of READY to 
extend Unicode objects to other languages.  A class like JString would 
be an unready Unicode object which when READY is called transfers the 
memory from Java, sets up the flags and sets up a pointer to the code 
point representation.  Unfortunately the READY concept is scheduled 
for removal and thus the chance to address the needs for proxying a 
Unicode to another languages representation may be limited. There may 
be other methods to accomplish this without using the concept of 
READY.  So long as access to the code points go through the Unicode 
API and the Unicode object can be extended such that the actual code 
points may be located outside of the Unicode object then a proxy can 
still be achieved if there are hooks in it to decided when a transfer 
should be performed.   Generally the transfer request only needs to 
happen once  but the key issue being that the number of code points 
(nor the kind of points) will not be known until the memory is 
transferred.


Java has much the same problem.   Although they defined an interface 
class “java.lang.CharacterArray” the actually “java.lang.String” class 
is concrete and almost all API methods take a String rather than the 
base interface even when the base interface would have been adequate.  
Thus just like Python has difficulty treating a foreign string class 
as it would a native one, Java cannot treat a Python string as native 
one as well.  So Python strings get represented as CharacterArray type 
which effectively limits it use greatly.


Summary:

A String proxy would need the address of the memory in the “wstr” slot 
though the code points may be char[], wchar[] or int[] depending the 
representation in the proxy.
API calls to interpret the data would need to check to see if the data 
is transferred first, if not it would call the proxy dependent 
transfer method which is responsible for creating a block of code 
points and set up flags (kind, ascii, ready, and compact).
The memory block allocated would need to call the proxy dependent 
destructor to clean up with the string is done.
It is not clear if this would have impact on performance.   Python 
already has the concept of a string which needs actions before it can 
be accessed, but this is scheduled for removal.


Are there any plans currently to address the concept of a proxy string 
in PyUnicode API?


I have a similar problem in PyObjC which proxies Objective-C classes
to Python (and the other way around). For interop with Python code I
proxy Objective-C strings using a subclass of str() that is eagerly
populated even if, as you mention as well, a lot of these proxy object
are never used in a context where the str() representation is
important.  A complicating factor for me is that Objective-C strings
are, in general, mutable which can lead to interesting behaviour.
Another disadvantage of subclassing str() for foreign string types is
that this removes the proxy class from their logical location in the
class hierarchy (in my case the proxy type is not a subclass of the
proxy type for NSObject, even though all Objective-C classes inherit
from NSObject).

I primarily chose to subclass the str type because that enables using
the NSString proxy type with C functions/methods that expect a string
argumen

[Python-Dev] Re: Enhancement request for PyUnicode proxies

2020-12-26 Thread Guido van Rossum
On Sat, Dec 26, 2020 at 3:54 AM Phil Thompson via Python-Dev <
python-dev@python.org> wrote:

> It's worth comparing the situation with byte arrays. There is no problem
> of translating different representations of an element, but there is
> still the issue of who owns the memory. The Python buffer protocol
> usually solves this problem, so something similar for unicode "arrays"
> might suffice.
>

Exactly my thought on the matter. I have no doubt that between all of us we
could design a decent protocol.

The practical problem would be to convince enough people that this is worth
doing to actually get the code changed (str being one of the most popular
data types traveling across C API boundaries), in the CPython core (which
surely has a lot of places to modify) as well as in the vast collection of
affected 3rd party modules. Like many migrations it's an endless slog for
the developers involved, and in open source it's hard to assign resources
for such a project.

-- 
--Guido van Rossum (python.org/~guido)
*Pronouns: he/him **(why is my pronoun here?)*

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/2FO5LQIO7UV4HKLROHUTPFKCBT2MH6DJ/
Code of Conduct: http://python.org/psf/codeofconduct/