Re: [Cython] Another string encoding idea

Stefan Behnel Sun, 29 Nov 2009 08:47:43 -0800

Robert Bradshaw, 28.11.2009 22:12:
> My personal concern is the pain I see porting Sage to Py3. I'd have to  
> go through the codebase and throw in encodes() and decodes() and  
> change signatures of functions that take char* arguments


That's what I figured. Instead of having to fix up the code, you want a
do-what-I-mean str data type that unifies everything that's unicode, bytes
and char*, and that magically handles it all for you.

In that case, you should drop the argument of Pyrex compatibility for now,
because I don't think you can have a Cython specific hyper-versatile data
type with automatic memory management and all that, while staying
compatible to the simple str/bytes type in Pyrex - even if we manage to get
it working without new syntax. We'd clearly break a lot of existing
Pyrex/Cython code by starting to coerce char* to unicode, for example.


> (which, I just realized, will be a step backwards for cpdef functions).

True. For cpdef functions, a char* parameter would be well-defined as long
as user code doesn't use different encodings for char* internally (which is
somewhat unlikely).


Ok, let's think this through. There's two different scenarios. One deals
with function signatures (strings going in and out), the other one deals
with conversion on assignments or casts. In total, there are three cases:
accepting bytes/str/unicode in a str/bytes/char* signature, coercing
str/unicode to char*, and coercing char* to bytes or unicode.

Function signatures have two sides to them that are not symmetric. One is
that you want your string accepting functions to be agnostic about the type
of string that comes in (although you may or may not want to have control
about memory usage if you use char* in the signature), and the other side
is that you want some string to go back out, which you may want to be a
Py2-str (read: bytes) or a unicode string (maybe in Py2 and definitely in
Py3). Remember that if your code originally couldn't handle unicode,
there's likely to be more code that can't handle it, either, so you
wouldn't want your hyper-versatile type to always turn into unicode.

1) Passing unicode strings into a function that expects char* means that
some kind of encoding must happen and a new Python bytes object must be
created on the fly. The input object isn't a problem here as the caller
holds a reference to it anyway. The encoded object, however, must have a
lifetime. Looking at buffer arguments, I wouldn't mind if that was the
lifetime of the function call itself. After all, it's the user's choice to
use char* instead of str/bytes/unicode. So the case of a parameter typed as
char* is actually easy to handle from a memory POV, given that some kind of
automatic encoding is in place.

2) Automatic encoding for an assignment from unicode to char* is tricky,
because you can't easily make assumptions about the lifetime of the unicode
object itself. You could get away with a weak-ref mapping from unicode
strings to their byte encoded representation. I think every other attempt
to keep track of the lifetime of the unicode object is futile in current
Cython. Think of code like this, which I would expect to work:

    cdef unicode u = u"abcdefg"
    cdef char* s1 = u
    u2 = u
    cdef char* s2 = u2
    u = None
    print u, u2, s1, s2

So supporting automatic unicode->char* coercion on assignments is really
hard to do internally.

3) The third case is the same for both sigs and assignments: automatic
decoding of char* to unicode vs. instantiation of a bytes object, i.e. the
following should do The Right Thing:

    cdef char* some_c_string = ...
    some_python_name = some_c_string

This would be heavily simplified if some_python_name was typed as either
bytes or unicode (the latter of which might fail due to decoding errors),
and even str would work if it did different things in Py2 and Py3 (with
potential decoding errors only in Py3). However, that won't work for
untyped return values of def functions. Given that users would likely want
to use bytes in Py2 (for simple non-unicode strings) and unicode for other
strings in Py2 and all text strings in Py3, this isn't easy to handle
automatically.

Now, the proposal was to enable this with a compiler directive, which would
basically provide a default encoding. If this directive was used, all
untyped coercions from char* to a Python object would use it. As Dag noted
already, this would interfere with type inference, as the resulting type
would still be char* in that case. The only exception are untyped function
return values.

For typed coercions to str or unicode, I personally don't think that it's
too much typing to require "c_string.decode(enc)", which would work nicely
with type inference. However, that would, again, not yield the
do-what-I-mean result of returning a byte string in Py2. Arguably, that
might be considered an optimisation, but it could still fall under the DWIM
compiler directive, e.g. as an "return_bytes_in_py2" option.


Ok, to sum things up, it looks like a special kind of coercion at function
call bounderies would be quite easy to support, and would work nicely with
type inference enabled. It would also match the support that CPython's
C-API argument unpacking functions have for converting Python strings.
Everything else would mean hard work inside of Cython and be rather hard to
explain to users.

BTW, I wouldn't mind extending the string input argument conversion support
to everything that supports the buffer protocol.

Comments?

Stefan
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Another string encoding idea

Reply via email to