On Nov 30, 2009, at 10:14 AM, Christopher Barker wrote:
>> Robert Bradshaw wrote:
>> this is the kind
>>> of thing that usually tells me there's a deficiency in the language
>>> that should be fixed to ease the users burden instead.
>
> sure -- but the deficiency is in C (and py2), and that's not something
> we can fix. As for the Cython language, it should really follow
> Python:
> unicode for "text", bytes for arbitrary data.
>
> But we need to deal with C (and fortran) no matter how you slice it.
>
> I wrote a similar post on the numpy list: I think the key from a
> user's
> perspective is that one is either working with "text": human readable
> stuff, or data. If text, then the natural python3 data type is a
> unicode
> string. If data, then bytes -- we should really follow that as best
> we can.
Exactly.
unicode = char* + length + encoding
bytes = char* + length
So what is the Python equivalent of char*? Neither, and what you want
depends on the application and context.
>> most of the
>> libraries we work with would probably balk at anything but ASCII
>> anyways
>
> This is key. unicode is new, and AFAICT, C still doesn't really have a
> decent way to deal with it anyway (it never even had a native string
> type).
>
> So a very, very, common usage is for C and Fortran code and
> libraries to
> expect char*, encoded in ASCI (or ANSI, but 1 byte per character, in
> any
> case). It needs to be easy, and perhaps automatic, to write code that
> crosses the Python-C border in these cases.
>
> I've lost track of what has been proposed here, but it seems to me
> that
> we need a Cython type:
>
> ANSI_string (not that that's what it should be called)
>
> It might be nice if there were a way to specify the encoding -- ASCII,
> Latin1, etc. though it would have to be a 1byte-per-character
> encoding.
> I'm not sure what the syntax could be for that, but I'd like to have
> it
> specified in there code near where it is used, rather than as a
> program-wide default.
Compiler directives can be specified on a per-file, per-function, or
per-block basis. On a per-line basis, I think it's easier to just call
s.decode("ASCII").
> If you declare a variable an ANSI_string, then Cython will convert
> to a
> char* internally, using ASCII (or another defined encoding). At the
> python level it could except either a unicode string or a byte string,
> passing the byte string right on through. A runtime errror would be
> raised if the input could not be ASCII encoded.
>
> It seems this would handle the very common case of libraries expecting
> simple ascii strings for flags, etc.
That is another idea. A new type would handle conversion to char*, but
not from char*. Bytes objects would still be returned by default
unless one did something extra there (which is fine for some uses, but
for other str is more natural).
> It would be kind of like numpy's "asarray" call, in that it may or may
> not make a copy, depending on what the input is, but I don't think
> that
> would be problem, as strings are immutable anyway.
>
> Wouldn't this be much like declaring a variable a C int, and being
> able
> to pass in python integers that may or may not (until run time) fit?
Yep, I'm thinking if the encoding fails, a runtime error would result.
> This completely from a user's perspective.
Thank you! The more user's perspective we can get the better.
- Robert
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev