On Apr 16, 2008, at 4:25 AM, Stefan Behnel wrote:
> Robert Bradshaw wrote:
>> On
>> the other end of things, I would really like to output .c files that
>> can be compiled and linked into either 2.x or 3.x extensions without
>> having to re-run Cython (modulo, perhaps, new builtins).
>
> Even builtins that are known to be a builtin in *some* but not all
> versions of Python could be supported with some module load time
> checking
> code. If you use them in your code, you won't be able to load the
> module
> into the interpreter if the builtin is not available in the running
> version. That's just like Python handles it.
Good idea. Actually, with our cached builtins, this might already
happens (i.e. at load time it does a lookup on all the builtin names
it uses).
>> Using PEP 263 to determine the encoding of string literals seems the
>> right thing to do. I don't want to loose the ability to do cdef char*
>> s = "test" (stored as an ASCII string)
>
> although the exact byte sequence in the C file would depend on the
> source
> encoding of the Cython file.
I think our C files should always be pure ascii.
>> Treating "xxx" as a char*
>> if it is pure ASCII, and as a unicode object otherwise, seems like
>> the obvious things to do.
>
> That's what I meant with "too much magic". Cython shouldn't
> distinguish
> between the two based on the *content*. The distinction should be
> explicit
> in the source and Cython should raise an error if it doesn't work out.
> Above all, this means: no automatic recoding behind the scenes.
In light of my proposal to use UTF-8 everywhere, this could actually
be turned into a char*.
> That's the main reason why Py3 has a well defined "bytes" type and a
> Unicode "str" type instead of a Unicode "unicode" type and an
> underdefined
> "str" type in Py2.
>
>> What hasn't been resolved is conversions
>>
>> cdef object o = s # s is a char*
>
> Sure, the semantics are clear: char* is a byte sequence in C, so the
> result is the equivalent of a byte sequence in Python: a byte
> string, i.e.
> a str object in Python2 and a bytes object in Py3.
I understand this distinction. Technically a char* is a byte string.
The problem is that people are going to want to implicitly handle
unicode <-> char* much more often.
> If you want a unicode string, use
>
> cdef object o = (<object>s).decode('UTF-8')
>
> or whatever, maybe even the C-API Unicode decoding functions. But make
> sure the encoding you use is explicit.
>
>
>> cdef char* s = o # o is a python unicode object (or,
>> equivalently, the result of str(o))
>
> That's not equivalent in Python 2, but it is in Py3.
>
>
>> Should this raise a compile time error?
>
> If the compiler knows that o *really* is of type "unicode", it can
> raise
> an error here. Otherwise, you'd get a runtime error from Python's
> string
> conversion functions.
>
>
>> (That would break a lot of
>> code...including really nice code like declaring a function argument
>> to be char*)
>
> That would still accept any kind of byte string or a bytes object
> in Py3,
> which is just fine IMHO.
I think this significantly impacts usability. For example, if I have
a function
def foo(char* x):
...
then users of my module won't be able to write foo("eggs") anymore,
they will have to write foo(b"eggs") or even foo(x.encode('UTF-8'))
if x is given to them from elsewhere. I don't think the user wants to
bother with that.
Likewise, if I have
def foo():
cdef char* s
...
return s
Then the user won't be able to write
print "The answer is %s" % foo()
or
foo() + "eggs"
You could say, well, do the conversion manually in the Cython file.
But one of the huge benifits of Cython is that it handles C <->
Python conversions naturally for you. char* might technically be a
bytes object, but conceptually it's equivalent to the default Python
string type (which happens to be unicode in Python 3000).
What is the disadvantage of simply using UTF-8 as the default
encoding for conversion to and from char* objects? (I am assuming
bytes(s) will be taken care of directly rather than attempting to
encode s (assumed to be a char*) into a unicode first).
>> Whatever happens, I think <object><char*>o == o and <char*><object>s
>> == s are important.
>
> This will continue to work as we are dealing with plain byte
> strings here.
>
>
>> I like Dag's "lang: ..." proposal. [...]
>> I think the default language should be
>> determined by the runtime environment of the compiler, i.e. (which
>> can always be overridden, ether globally or file-by-file, but
>> probably won't need to be most of the time).
>
> I actually prefer having it in the source file. Nothing keeps you from
> writing one source file in Py2 and another in Py3 and combining
> them into
> one module. :)
Yes, this should always be an option. But having it default to the
target language of the compile-time environment lets the compiler
transition when the user does.
- Robert
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev