I've had an idea that might help with making the
encoding and decoding of unicode strings more
automatic.
Suppose we have a way of expressing a type parameterised
with an encoding, maybe something like
encoding[name]
We could have a few predefined ones, such as
ctypedef encoding['ascii'] ascii
ctypedef encoding['utf8'] utf8
ctypedef encoding['latin1'] latin1
These are Python object types. Internally they're
represented as bytes objects, but the compiler knows
statically that they have an encoding associated with
them, and the appropriate encoding and decoding
operations are performed when coercing from and to
strings.
Being bytes, they can also be cast to char * without
any problem. So we can write things like
cdef extern from "foo.h":
void cflump(char *)
def flump(utf8 s):
cflump(s)
Now we can pass a unicode string to flump() and it will
first be encoded to bytes as utf8, and then passed to
cflump() as a char *.
For going the other way, we also need a corresponding
family of C string types with associated encodings. We
could give them different names, but that isn't really
necessary, since we can re-use the same ones:
cdef extern from "foo.h":
utf8 *cbrazzle()
This is unambiguous, because you can't declare a pointer
to a Python object. What we're saying here is that
cbrazzle() returns a char *, but it is to be understood as
encoded in utf8. So we can write
def brazzle():
return <str>cbrazzle()
and the return value from cbrazzle is automatically
decoded using utf8.
I've put a cast there because otherwise there would be an
ambiguity -- should a utf8 * be converted to a str on
coercion to a Python type, or a utf8 (i.e. bytes) object?
Having to use a cast is a bit ugly, though. It could be
eliminated by allowing a def function to specify a return
type:
def str brazzle():
return cbrazzle()
Or there could simply be a rule that resolves the
ambiguity in favour of str whenever the target type is
a generic Python object, in which case we could simply
write
def brazzle():
return cbrazzle()
What do you think? Seems like this sort of scheme would
keep the encoding being used at each point fairly explicit
without being too intrusive.
--
Greg
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev