Re: [Cython] Another string encoding idea

Robert Bradshaw Mon, 30 Nov 2009 19:24:50 -0800

On Nov 30, 2009, at 10:14 AM, Christopher Barker wrote:

>> Robert Bradshaw wrote:
>> this is the kind
>>> of thing that usually tells me there's a deficiency in the language
>>> that should be fixed to ease the users burden instead.
>
> sure -- but the deficiency is in C (and py2), and that's not something
> we can fix. As for the Cython language, it should really follow  
> Python:
> unicode for "text", bytes for arbitrary data.
>
> But we need to deal with C (and fortran) no matter how you slice it.
>
> I wrote a similar post on the numpy list: I think the key from a  
> user's
> perspective is that one is either working with "text": human readable
> stuff, or data. If text, then the natural python3 data type is a  
> unicode
> string. If data, then bytes -- we should really follow that as best  
> we can.


Exactly.

unicode = char* + length + encoding
bytes = char* + length

So what is the Python equivalent of char*? Neither, and what you want  
depends on the application and context.

>> most of the
>> libraries we work with would probably balk at anything but ASCII
>> anyways
>
> This is key. unicode is new, and AFAICT, C still doesn't really have a
> decent way to deal with it anyway (it never even had a native string  
> type).
>
> So a very, very, common usage is for C and Fortran code and  
> libraries to
> expect char*, encoded in ASCI (or ANSI, but 1 byte per character, in  
> any
> case). It needs to be easy, and perhaps automatic, to write code that
> crosses the Python-C border in these cases.
>
> I've lost track of what has been proposed here, but it seems to me  
> that
> we need a Cython type:
>
> ANSI_string  (not that that's what it should be called)
>
> It might be nice if there were a way to specify the encoding -- ASCII,
> Latin1, etc. though it would have to be a 1byte-per-character  
> encoding.
> I'm not sure what the syntax could be for that, but I'd like to have  
> it
> specified in there code near where it is used, rather than as a
> program-wide default.

Compiler directives can be specified on a per-file, per-function, or  
per-block basis. On a per-line basis, I think it's easier to just call  
s.decode("ASCII").

> If you declare a variable an ANSI_string, then Cython will convert  
> to a
> char* internally, using ASCII (or another defined encoding). At the
> python level it could except either a unicode string or a byte string,
> passing the byte string right on through. A runtime errror would be
> raised if the input could not be ASCII encoded.
>
> It seems this would handle the very common case of libraries expecting
> simple ascii strings for flags, etc.

That is another idea. A new type would handle conversion to char*, but  
not from char*. Bytes objects would still be returned by default  
unless one did something extra there (which is fine for some uses, but  
for other str is more natural).

> It would be kind of like numpy's "asarray" call, in that it may or may
> not make a copy, depending on what the input is, but I don't think  
> that
> would be problem, as strings are immutable anyway.
>
> Wouldn't this be much like declaring a variable a C int, and being  
> able
> to pass in python integers that may or may not (until run time) fit?

Yep, I'm thinking if the encoding fails, a runtime error would result.

> This completely from a user's perspective.

Thank you! The more user's perspective we can get the better.

- Robert


_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Another string encoding idea

Reply via email to