Hi,

Dominic Sacré wrote:
> I'm trying to make a Pyrex/Cython module that was originally written for 
> Python 2.x work with Python 3.x, while at the same time keeping it 
> compatible with older versions.
> 
> It seems like when using Python 3.x, Cython will automatically replace 
> 'unicode' with 'str', and 'str' with 'bytes'. Also, string literals are 
> interpreted as 'bytes' unless prefixed with 'u'.

Correct.


> However, 'bytes' is not really useful in a context where an actual 
> string is expected

You mean "text", I suppose? "string" is ambiguous as it can refer to C
strings, Python byte strings and Python Unicode strings.


> and causes problems for example when working with
> strings passed from Python.
> (One of many issues I have run into is the fact that b"foo" != "foo"...)

Yep, and that's a really good thing.

I fixed loads of those in Cython lately, and tons of them in the test suite.


> The only solution I've found to at least get most of my code working is 
> basically to use unicode for almost everything

That's the way to go anyway. To make the code Unicode aware, you have to
make it distinguish between text, encoded text and data.


> but if possible I'd like to avoid unicode strings in the 2.x version.

That's not impossible, but it certainly is some work and the benefit is
rather questionable, as it can easily bite you if you do not take care
about the three-fold separation above.

I do this in lxml as the API dictates that under Py2, ASCII compatible byte
strings are accepted and returned as ASCII encoded byte strings. I actually
work completely with UTF-8 encoded strings inside of lxml and use dedicated
functions for checking and encoding everything that comes through the API
or that goes back to the user.

The main theme is to decide if you want to work with unicode internally or
with encoded byte strings. Choose one or the other, not both. And make sure
you check byte strings that contain text on the way in and reject them in
the face of encoding ambiguity.

In any case, data byte strings should remain unchanged, although you may
run into all sorts of problems with file names (which are really text but
that won't necessarily help you when trying to find them in an encoded file
system, or when a user passes you an encoded URL that came from whatever
source).


> Is there a sane way to use the native string type (i.e. 'str') in either 
> Python version?

... and have Cython automatically encode and decode the byte strings for
you? No, certainly not. Encoding is an explicit operation and it will make
your code safer to make it explicit.

Stefan

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to