Re: [pypy-dev] UTF8 string passing in cffi and PyPy internal string optimizations

Armin Rigo Sat, 21 Mar 2015 10:09:55 -0700

Hi,

On 18 March 2015 at 15:49, Amaury Forgeot d'Arc <amaur...@gmail.com> wrote:
> 2015-03-17 18:27 GMT+01:00 Eleytherios Stamatogiannakis <est...@gmail.com>:
>> Right now when PyPy receives a utf8 string (from a C function) it has to
>> do 2 copies:
>>
>> 1. convert the cdata string to a pypy byte string via ffi.string
>> 2. convert ffi.string to a unicode string
>>
>> When pypy sends a utf8 string it also does 2 copies:
>>
>> 1. convert pypy unicode string to utf8-encoded byte string
>> 2. copy the byte string into a cdata string.


The "easy" solution to reduce the number of copies is to have one
custom function that does both steps.  The more involved solution that
you suggest is, imho, breaking the way CFFI is supposed to work; see
below.

>> From what i understand, there is a cffi optimization dealing with windows
>> unicode (via set_unicode) where on windows platforms and when using the
>> native windows unicode strings, cffi avoids doing one of the copies in both
>> of above cases.
>>
>> On linux where the default unicode format for C libraries nowadays is
>> UTF8, there is no such optimization, so we have to do the two copies in all
>> string passing.

I think you're misunderstanding set_unicode() (or else I'm
misunderstanding what you say).  It's just a way to declare some
Windows-specific unicode types, like TCHAR, to be either "char" or
"wchar_t".  It doesn't enable or disable any optimization.

>> PyPy at some point was going towards using utf8 string internally, but i
>> don't know if this is still the plan or not.

PyPy might go there, at some point, but clearly not CPython.  We still
need a way to avoid the double copies there.

>> 1. If PyPy doesn't go towards using utf8 strings internally, maybe we need
>> some special C type that denotes that the string is utf8 and pypy/cffi
>> should do the conversion from-to it automatically. Something like "wchar_t"
>> in windows but denoting a utf8 string. CFFI can define a special type
>> ("__utf8char_t"?) for these strings.

What we really want is simply a variant of ffi.string() that accepts a
"char *" pointer, interprets it as utf-8, and returns a unicode
object; as well as another function that does the opposite.  If you're
interested in supporting the Windows case specially, then you want a
variant that would copy from/to a "TCHAR *" pointer on Windows.  This
is doable without any CFFI special types.

> This is a first step towards SWIG's typemaps:
> http://www.swig.org/Doc3.0/Typemaps.html#Typemaps_nn4
>
> That's also something I wanted to have in another projects: automatic
> conversion to PYTHON_HANDLE, for example.
>
> But typemaps are a tough thing, and they would likely differ between CPython
> and PyPy.
> Armin, what do you think?

I think that typemaps are not the right solution to this problem :-)


A bientôt,

Armin.
_______________________________________________
pypy-dev mailing list
pypy-dev@python.org
https://mail.python.org/mailman/listinfo/pypy-dev

Re: [pypy-dev] UTF8 string passing in cffi and PyPy internal string optimizations

Reply via email to