Re: [Cython] String types with Python 2.x and 3.x

Robert Bradshaw Sat, 12 Sep 2009 10:30:23 -0700

On Sep 12, 2009, at 12:35 AM, Stefan Behnel wrote:

> Robert Bradshaw wrote:
>> If I compile the module against Py2, it should behave as if it
>> was a .py file under Py2, and if I compile the module under Py3, it
>> should behave as if it were a .py file under Py3. Moving code
>> from .py to .pyx should not change its behavior.
>
> Well, when you run a Py2 script in Py3, the semantics change. So it  
> doesn't
> make sense to say "moving code from .py to .pyx should not change its
> behavior", as the same .py file can already have different behaviour.
>
> I'm fine with providing a separate front-end for compiling Python 3  
> code
> ("cython3" ?), so I'm also fine with providing a separate front-end  
> for
> compiling Python 2 code. Simply seeing the .py extension isn't  
> enough anymore.
>
> I'm also fine with a command line option "-3"/"-2" that defines the
> semantics when compiling a .py file.


I think investigating something along these lines would be good.

> However, once the compilation is done,
> I think the semantics of literals should be fixed and should not  
> change
> depending on the platform.

It already does out of necessity.

a = 10
b = 1000000000000000000000000000
type(a) == type(b) # depends on the environment

>
>>>     isinstance("abc", unicode)
>>>
>>> return False in Py2 and True in Py3.
>>
>> This is an error in Py3.
>
> Correct, but neither in Python 2 nor in Cython, which currently  
> uses the
> Py2 builtin names.
>
>
>> I don't see "abc" as a byte string, I see it as a string literal. If
>> it's used in a C context it's a byte string, and if used as a Python
>> object it's a Python str. This is how we handle all other literals
>> (e.g. large integer literals used as Python objects are not the same
>> as large integer literals truncated to an int then used as a Python
>> object).
>
> So your proposal is to make
>
>       cdef char* s
>
>       s = "äöäüöfs#dfsjdföasjf"
>
> a C byte string encoded in source encoding, and
>
>       s = "äöäüöfs#dfsjdföasjf"
>
> a byte string in source-encoding when run in Python 2 and a decoded  
> unicode
> string when run in Python 3?

Yep.

> Note that this means that
>
>       s = "äöäüöfs#dfsjdföasjf"
>
>       cdef char* cs = s
>
> will work in Py2 and fail in Py3, whereas it currently works  
> identically in
> both.

Several things will change, e.g. range will return an iterator, not a  
list. (We could have a mode where Cython emulates the Py2 builtins  
even when compiled against Py3, but probably not by default). These  
are all things that are easy consequences of how py3 differs from  
py2. I think "strings are different in Py3" is much easier to  
explain, and reference, than "cython string literals are no longer  
strings" (where by strings here I mean str, the type any programmer  
gets whey they type a string literal into the prompt).

As for the double assignment being different, again, using integers  
as an example

int a = 1000000000000000000000000000

behaves differently than

a= 1000000000000000000000000000
int ca = a

> This means that you'd have to prefix basically all Python string  
> literals
> with either 'b' or 'u' if you want a fixed type/semantics, whereas  
> now you
> only have to prefix Python unicode strings with a 'u', following  
> Python 2
> syntax.

If one always wants bytes, one can do b"something." If one always  
wants unicode, one can do u"something." There's currently no  
(obvious, clean) way to get str.

> Given that this is more code overhead,

I don't think this is more code overhead--most people don't prefix  
their string literals with anything at all, they just think of them  
as "strings." Now you can say "it forces users who want py2 an py3  
compatibility to explicitly use unicode everywhere, which they should  
have been doing anyways, or there code is broken" but I think this  
artificially raises the barrier for using Cython by imposing an  
independent presumption (despite any validity).

Put another way, its extra overhead (and probably incorrect) to deal  
with the bytes object when using a cython module, and extra overhead  
to prefix all literals in the cython module with 'u.' It makes mixing  
strings from the environment with those from the module more  
cumbersome. Unless you're dealing with char* <-> object conversions  
you shouldn't have to think or care about encodings IMHO.

> do you have a real use case for
> literals that behave that way? The only place I've seen this so far  
> are
> keyword argument dicts that you fill with literal string names. A  
> rather
> rare thing, IMHO, and easy to fix using e.g. the dict() factory.

Clearly Dominic has a usecase.

I have a simple usecase too. Often in Sage one has functions like

def charpoly(self, algorithm='default'):
     if algorithm == 'a':
         ...
     else if algorithm == 'b':
         ...
     else:
         raise ValueError("Unknown algorithm: %s" % algorithm)

This will break if I run it in Python 3. You could say that we should  
be prefixing these with 'u', but frankly, I don't see the benefit. (I  
do like unicode in general, it's just not worth the overhead here.)  
Specifically would mean we have to get everyone who writes code of  
the above form to use 'u' despite the fact that the existence of  
unicode is *completely* irrelevant to the task at hand. These are  
just strings, I don't want to have to think about (or, more to the  
point, explain) byte strings, encodings, unicode, etc. unless one is  
actually dealing with byte strings, encodings, etc.

Perhaps the difference in opinion comes from my perspective that, at  
a high level, str just got changed (for the better) in Py3.

- Robert

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] String types with Python 2.x and 3.x

Reply via email to