Re: [Cython] Another string encoding idea

Robert Bradshaw Sat, 28 Nov 2009 13:14:49 -0800

On Nov 28, 2009, at 6:13 AM, Dag Sverre Seljebotn wrote:

> Robert Bradshaw wrote:
>> On Nov 27, 2009, at 10:52 PM, Stefan Behnel wrote:
>>> Currently, coercion from char*/bytes to unicode is an explicit step
>>> that is
>>> easy to do via
>>>
>>>   cdef char* s = ...
>>>   u = s[:length].decode('UTF-8')
>>>
>>> in 0.12. See
>>>
>>> http://trac.cython.org/cython_trac/ticket/436
>>
>> That is an improvement, though still a lot more baggage than
>>
>> cdef char* s = ...
>> u = s
>
> Hmm. Seeing it in action makes me worry even more. I'm leaning towards
> -1 for the whole proposal now.
>
> In Python "u = s" always mean a strict transfer of reference. In  
> Cython
> we diverge from this (apart from raising exceptions on mismatch) in  
> some
> places:
>  a) When converting intrensic types. However these are always  
> immutable
> in Python and so the semantic mismatch isn't there.
>  b) Structs
>  c) When converting char*<->bytes? (If a copy is made, otherwise it  
> can
> be considered similar with Python. I'm not sure what the case is.).
>
> Making "u = s" mean more than a pure transfer of reference/copy of
> immutable object makes problems for both pure Python mode and
> possibility of type inference. I believe it contradicts the direction
> we've gone in -- that static types should be as optional as we can
> possibly make them, instead, it is proposed that "u = s" is overloaded
> to mean encoding conversion, which is something quite different from  
> an
> assignment.


For the C <-> Python conversions (whether by assignment or casting) as  
"create the best Python (or C) equivalent."  The directive would flag  
that the Python equivalent of char* is str in Py3, not bytes.

Trying to make it easy to not violate the principle of least surprise,  
as I find bytes objects surprising to deal with.

> I believe this contradicts the Pythonic philosophy of being explicit
> (where even "self" is passed explicitly...).
>
> One solution around this would be to create a new, Cython-specific
> string class which constitutes a view on a char*, rather than a copy.
> Views are fine (as they are semantically similar to a pure reference
> assignment "u = s").

That may be as much overhead (and less intuitive) than explicitly  
decoding and encoding.

> Is proper string handling creating big problems in Sage, since the
> question keeps coming up?

With Sage we ignore the issue completely (with the exception of the  
notebook, which is all in Python anyway), and it works fine. In fact,  
I can't remember any complaints about it (again, except for the  
notebook) and we have more non-US users than US users (extrapolating  
from the latest web stats). I don't usually bring the topic up, it  
just comes up in response to user inquiries.

My personal concern is the pain I see porting Sage to Py3. I'd have to  
go through the codebase and throw in encodes() and decodes() and  
change signatures of functions that take char* arguments (which, I  
just realized, will be a step backwards for cpdef functions). The  
thought of mechanically going through and doing all of this,  
especially when I would be surprised to see any benefit (most of the  
libraries we work with would probably balk at anything but ASCII  
anyways), makes me wonder if there's a better way...this is the kind  
of thing that usually tells me there's a deficiency in the language  
that should be fixed to ease the users burden instead. I would also  
have a hard time explaining to people (including myself) why this step  
of encoding/decoding can't just be automated everywhere (unless  
there's truly a technical obstruction). That's where I'm coming from.

> I just don't mentally associate char* with
> strings at all and thus didn't ever think about this as a problem...

What do you associate with strings in C?

> I don't know why one wouldn't want to call encode/decode explicitly if
> only to make better self-documenting code about what is going on.

If explicitly encoding and decoding is irrelevant to the purpose of  
the function call, I think it makes the code less clean and readable.

- Robert

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Another string encoding idea

Reply via email to