> The more basic question is if we can transparently support unicode in  
> char*, why not? Even for non-English speakers, the majority of  
> strings being passed around will be ASCII.
>   
Always defaulting to UTF-8 for this could be confusing in some contexts. 
For instance, if one has a Cython source file in latin1, and calls a 
spelling correction library that works exclusively in latin1 (I've 
worked with such a library once...), and in general don't touch UTF-8 
anywhere, it might seem confusing that UTF-8 is passed to the library.

All in all it seems to be the lesser of evils though. (In particular I 
like defaulting to UTF-8 a lot better than having the encoding of the 
Cython source matter, which is where Stefan would disagree if I 
understand correctly.)

> I think both (a) and (b) are non-negligible issues, especially in the  
> context of wrapping existing C libraries. Having to learn a new type  
> like utf8charbuf, (which it masks the pointer nature of it as well,  
> is its memory managed?) isn't desirable, especially if one is casting  
> everywhere back between any object and char*. It also creates the  
> expectation that all different kinds of encodings need to be  
> supported with their own special type, and I don't think we want  
> anything as heavy as a class.
>   
OK, I've polished it to deal with some of these. Your main points are 
still valid though so I'll consider it dismissed...

It wouldn't be beyond the Cython compiler to do something like

cdef uchar("utf-8")* buf = "my æøåÅ"

Which would directly be translated to

cdef char* buf = "my \some\escape\sequence"

and have

cdef uchar("utf-8")* buf = pyobj

become

cdef char* buf = unicode(pyobj).encode("utf-8")

It wouldn't be complicated to support many encodings, they would just be 
passed on to CPython. No heavy class involved.

-- 
Dag Sverre

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to