On Dec 11, 2009, at 6:02 PM, Greg Ewing wrote:
> I've had an idea that might help with making the
> encoding and decoding of unicode strings more
> automatic.
>
> Suppose we have a way of expressing a type parameterised
> with an encoding, maybe something like
>
> encoding[name]
>
> We could have a few predefined ones, such as
>
> ctypedef encoding['ascii'] ascii
> ctypedef encoding['utf8'] utf8
> ctypedef encoding['latin1'] latin1
>
> These are Python object types. Internally they're
> represented as bytes objects, but the compiler knows
> statically that they have an encoding associated with
> them, and the appropriate encoding and decoding
> operations are performed when coercing from and to
> strings.
>
> Being bytes, they can also be cast to char * without
> any problem. So we can write things like
>
> cdef extern from "foo.h":
> void cflump(char *)
>
> def flump(utf8 s):
> cflump(s)
>
> Now we can pass a unicode string to flump() and it will
> first be encoded to bytes as utf8, and then passed to
> cflump() as a char *.
So if I'm understanding correctly here utf8 would behave like a bytes
object except one could assign unicode objects to it? Would
def flump(utf8 s):
return s
return a bytes object?
I think I've mentioned this before, but I find conversion/construction
happening on object/object boundaries to be a bit less intuitive, a
bit like "def flump(tuple t)" accepting a list and creating a tuple
behind the scenes. The final goal is not to get a bytes object, but a
char*, so it seems more natural to put the decoding at that spot.
> For going the other way, we also need a corresponding
> family of C string types with associated encodings. We
> could give them different names, but that isn't really
> necessary, since we can re-use the same ones:
>
> cdef extern from "foo.h":
> utf8 *cbrazzle()
>
> This is unambiguous, because you can't declare a pointer
> to a Python object. What we're saying here is that
> cbrazzle() returns a char *, but it is to be understood as
> encoded in utf8. So we can write
>
> def brazzle():
> return <str>cbrazzle()
>
> and the return value from cbrazzle is automatically
> decoded using utf8.
>
> I've put a cast there because otherwise there would be an
> ambiguity -- should a utf8 * be converted to a str on
> coercion to a Python type, or a utf8 (i.e. bytes) object?
>
> Having to use a cast is a bit ugly, though. It could be
> eliminated by allowing a def function to specify a return
> type:
>
> def str brazzle():
> return cbrazzle()
>
> Or there could simply be a rule that resolves the
> ambiguity in favour of str whenever the target type is
> a generic Python object, in which case we could simply
> write
>
> def brazzle():
> return cbrazzle()
Actually, Cython already has a c_utf8_char_array_type that I think is
supposed to do this, though I don't think it's actually used anywhere.
There is kind of an odd asymmetry here, for instance if I had a
function that both accepted and returned a char* I would have to write
cdef extern from "foo.h":
utf8* cblarg(char*)
[somewhere much later]
def blarg(utf8 s):
return cblarg(s)
Another disadvantage of attaching the encoding to the C signature is
that for many declarations, especially ones that could be widely
shared (printf, fread, ...) or eventually auto-generated (from a C
header file) it doesn't make as much sense to attach an encoding to
the C function so much as to the module/function in which its used.
> What do you think? Seems like this sort of scheme would
> keep the encoding being used at each point fairly explicit
> without being too intrusive.
My whole goal was to not have to be explicit at each point, but to be
able to specify the encoding (or at least to use a default encoding)
for an entire file, function, or block of code at once rather than at
every line. You're right, this isn't very intrusive, but no matter how
unintrusive, there's still the matter of converting old code (e.g.
Sage) as yet unwritten code (not necessarily by those participating in
this discussion, but any current or future Cython users who aren't
thinking about unicode or targeting Py3 yet) and the fact that it's
just one more thing to have to learn and constantly keep in mind.
If I want to be explicit at every point, Stefan's optimized .encode()
and .decode() plus a cython.str() special method seem natural enough,
and also have the advantage that it's what you'd right in Python
anyways, so there's no new keywords and special "cython" way of doing
things. The only glaring deficiency is the inverse of cython.str,
which would create a char* from a bytes or unicode, which would be
especially convenient to be able to do for function arguments
(including cpdef functions that would be able to accept a raw char*).
(As an aside, perhaps we could let str(...) take a char* directly and
(optional or not) encoding so they wouldn't even have to use
cython.str(...).)
- Robert
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev