Re: [Cython] target language syntax of Cython: Py2.6 or Py3.0?

Robert Bradshaw Wed, 16 Apr 2008 11:49:26 -0700

On Apr 16, 2008, at 4:25 AM, Stefan Behnel wrote:
> Robert Bradshaw wrote:
>> On
>> the other end of things, I would really like to output .c files that
>> can be compiled and linked into either 2.x or 3.x extensions without
>> having to re-run Cython (modulo, perhaps, new builtins).
>
> Even builtins that are known to be a builtin in *some* but not all
> versions of Python could be supported with some module load time  
> checking
> code. If you use them in your code, you won't be able to load the  
> module
> into the interpreter if the builtin is not available in the running
> version. That's just like Python handles it.


Good idea. Actually, with our cached builtins, this might already  
happens (i.e. at load time it does a lookup on all the builtin names  
it uses).

>> Using PEP 263 to determine the encoding of string literals seems the
>> right thing to do. I don't want to loose the ability to do cdef char*
>> s = "test" (stored as an ASCII string)
>
> although the exact byte sequence in the C file would depend on the  
> source
> encoding of the Cython file.

I think our C files should always be pure ascii.

>> Treating "xxx" as a char*
>> if it is pure ASCII, and as a unicode object otherwise, seems like
>> the obvious things to do.
>
> That's what I meant with "too much magic". Cython shouldn't  
> distinguish
> between the two based on the *content*. The distinction should be  
> explicit
> in the source and Cython should raise an error if it doesn't work out.
> Above all, this means: no automatic recoding behind the scenes.

In light of my proposal to use UTF-8 everywhere, this could actually  
be turned into a char*.

> That's the main reason why Py3 has a well defined "bytes" type and a
> Unicode "str" type instead of a Unicode "unicode" type and an  
> underdefined
> "str" type in Py2.
>
>> What hasn't been resolved is conversions
>>
>>      cdef object o = s # s is a char*
>
> Sure, the semantics are clear: char* is a byte sequence in C, so the
> result is the equivalent of a byte sequence in Python: a byte  
> string, i.e.
> a str object in Python2 and a bytes object in Py3.

I understand this distinction. Technically a char* is a byte string.  
The problem is that people are going to want to implicitly handle  
unicode <-> char* much more often.

> If you want a unicode string, use
>
>     cdef object o = (<object>s).decode('UTF-8')
>
> or whatever, maybe even the C-API Unicode decoding functions. But make
> sure the encoding you use is explicit.
>
>
>>      cdef char* s = o # o is a python unicode object (or,
>> equivalently, the result of str(o))
>
> That's not equivalent in Python 2, but it is in Py3.
>
>
>> Should this raise a compile time error?
>
> If the compiler knows that o *really* is of type "unicode", it can  
> raise
> an error here. Otherwise, you'd get a runtime error from Python's  
> string
> conversion functions.
>
>
>> (That would break a lot of
>> code...including really nice code like declaring a function argument
>> to be char*)
>
> That would still accept any kind of byte string or a bytes object  
> in Py3,
> which is just fine IMHO.

I think this significantly impacts usability. For example, if I have  
a function

     def foo(char* x):
         ...

then users of my module won't be able to write foo("eggs") anymore,  
they will have to write foo(b"eggs") or even foo(x.encode('UTF-8'))  
if x is given to them from elsewhere. I don't think the user wants to  
bother with that.

Likewise, if I have

     def foo():
         cdef char* s
         ...
         return s

Then the user won't be able to write

     print "The answer is %s" % foo()

or

     foo() + "eggs"

You could say, well, do the conversion manually in the Cython file.  
But one of the huge benifits of Cython is that it handles C <->  
Python conversions naturally for you. char* might technically be a  
bytes object, but conceptually it's equivalent to the default Python  
string type (which happens to be unicode in Python 3000).

What is the disadvantage of simply using UTF-8 as the default  
encoding for conversion to and from char* objects? (I am assuming  
bytes(s) will be taken care of directly rather than attempting to  
encode s (assumed to be a char*) into a unicode first).

>> Whatever happens, I think <object><char*>o == o and <char*><object>s
>> == s are important.
>
> This will continue to work as we are dealing with plain byte  
> strings here.
>
>
>> I like Dag's "lang: ..." proposal. [...]
>> I think the default language should be
>> determined by the runtime environment of the compiler, i.e. (which
>> can always be overridden, ether globally or file-by-file, but
>> probably won't need to be most of the time).
>
> I actually prefer having it in the source file. Nothing keeps you from
> writing one source file in Py2 and another in Py3 and combining  
> them into
> one module. :)

Yes, this should always be an option. But having it default to the  
target language of the compile-time environment lets the compiler  
transition when the user does.

- Robert

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] target language syntax of Cython: Py2.6 or Py3.0?

Reply via email to