> "In the face of ambiguity, refuse the temptation to guess." :)
>
> Somehow "inferring" the difference between str and unicode literals is the
> wrong thing to do.
>   
I don't think I explained my question well enough; I'll try again.

The thing is, this kind of inferring already happens; you can do

cdef char c = "c"

and the string literal "c" becomes a single character value, while you 
can do

cdef char* s = "hello"

and you get a C string literal (which is passed through straight from 
Cython source), while

py_s = "hello"

gives a Python object. Somehow the "natural" thing to do for Py3 is to 
continue allowing "direct" assignments to char* of the type above; but 
generate unicode objects on coercion to Python object. (Hmm. So the 
problem is that one can no longer auto-coerce from Python string objects 
to char*...)

Hmm. This might come from a wrong understanding of the problem, but from 
my limited knowledge, it looks like the reason we get this problem is 
because the current Cython behaviour is wrong, even in a Python 2.6 
context. Suggestion:

- Support PEP 263 as you say. This is for *input* from Cython source 
*only*; the whole point is that whether you edit your source files on a 
UTF-8 or BIG-5 system shouldn't impact anything about runtime behaviour 
as long as you declare the encoding of the source file.
- Have a seperate mechanism for specifying what encoding should be used 
for conversion to C buffers. One solution is command-line options; 
however this is also a candidate for a Cython language extensions, as 
the "right" answer really depends on what encoding the C library you are 
calling is using! (char* is basically "encoding-less" in itself). One 
might even hard-code it to ASCII or latin1 for now.
- String literals to buffers (cdef char* s = "hello") are reencoded in 
Cython compilation to the right target encoding, so that if latin1 is 
specified for the C library in question I can get correct results 
editing the Cython source in UTF-8. In fact, for maximum portability of 
C source, one can use the literal if only ASCII is used, and otherwise 
generate stuff like

char* s = {-20, 54, 50, 0}

. If there's a mismatch between input and output encoding (I defined the 
C library I'm calling as ASCII but try to use my native "øåæÅØ") then 
it's a compile-time error.

- On coercions from Python strings (unicode or whatever) to char*, the 
same reencoding is used (call s.encode(ENCODING) or similar). This will 
raise the appropiate exceptions.

It would be good to solve this anyway and I fail to see the connection 
with Python 3, and I definitely don't think that Cython behaviuor needs 
to be different between the two (even if everything is unicode in Python 
3 there should be functionality somewhere in the library to generate 
byte data in other encodings?)

Dag Sverre
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to