Re: [Cython] Idea for automatic encoding and decoding

Robert Bradshaw Mon, 14 Dec 2009 15:42:16 -0800

On Dec 13, 2009, at 3:11 AM, Stefan Behnel wrote:

> Robert Bradshaw, 13.12.2009 10:51:
>> On Dec 12, 2009, at 11:35 PM, Stefan Behnel wrote:
>>> So I think the right solution is to support automatic conversion
>>> *only* at the Python call boundary, i.e. for Python function
>>> parameters and return values.
>>
>> I disagree. Most of the examples here have been very simple, but in
>> general Python/C boundary need not be cleanly aligned with the Python
>> call boundary. Some more general examples would be
>>
>>     cdef extern from "foo.h":
>>         cdef cblarg(int i, char*):
>>
>>     def blarg(obj):
>>         cblarg(obj.id, obj.name)          # I realize I'm assuming
>> name is not a dynamically generated attribute...
>>
>> or even
>>
>>     def barg_all(list L):
>>         for i, a in enumerate(L):
>>             cblarg(i, a)
>
> I guess I'm still not used to passing arbitrary user values into a C
> function call without doing some kind of parameter checking before  
> hand.
> That's different for function arguments, where only the encoding would
> happen automatically (and would raise an appropriate error on  
> failure), and
> the result would still be a safe Python bytes object that users can
> validate in any way they want, without having to care about 0 bytes
> silently becoming end markers.
>
> We are still talking about two different use cases here. One deals  
> with
> automatic encoding of unicode strings into byte strings on input and  
> with
> automatic decoding of byte strings (or char*) on the way out.


Yep, though they are related. I would imagine that most (though not  
all) of the time one an API would return the same kind of string it  
expects.

> The other use case deals with automatic coercion of Python string  
> objects
> to char*, which is what you show above. I personally think it's good  
> to
> keep those separate.
>
> Remember that you mentioned the performance issue of a char* vs. a  
> Python
> object parameter when the function is called from Cython code? The  
> only
> place where this matters is for cpdef functions, and that should be  
> rare
> enough to ignore it and require an explicit wrapper function,

Not if we start using a def -> cpdef optimization by default.

> as it's quite likely that user input would have to be validated  
> separately anyway.
>
> To make this clear: I don't think it's worth encouraging users to drop
> input validation in favour of automatic and unsafe coercion.

I don't think users doing input validation are going to stop doing  
input validation because of an easier str -> char* conversion option.  
I'm also skeptical that having to manually do str -> byes -> char*  
encourages input validation. Validation is good. Shunning user  
friendliness to try to enforce validation is not (in my mind) so good.

>
>> I'm all for making string encodings easier to use, though as I've  
>> said
>> encode() and decode() seem to be a clean enough solution for nearly
>> everything but argument parsing.
>
> That seems to match my distinction above then.
>
>
>> However (and maybe this belongs on the other thread), you are
>> completely skirting the issue of being able to declare the encoding
>> for a block of code in one place, rather than having to specify it
>> every single place it is used.
>
> Yes, the above would actually be orthogonal to that feature.  
> Although I'm
> not sure simply saying
>
>    def func(bytes s):
>        ...
>
> plus a global setting somewhere at the top of your code is really  
> readable
> enough as "this function accepts unicode strings which get converted
> automatically". And, no, I don't think typing the input parameter as  
> "str"
> is what people want in most cases. I'm really leaning towards the
> assumption that most people really *want* bytes as basic string  
> input type
> in their Cython code. Either that, or exactly unicode strings. Not  
> 'str'.

I agree with you for Py3, but Py2 is an important target, arguably  
more important than Py3 at this point in time (until numpy and the  
rest of the scientific world moves over), and will be with us for at  
least a while longer.

>
>> I initially thought your concern with
>> char* <-> unicode conversion was the ambiguity in what character set
>> to use, which I was proposing could be declared at a higher than  
>> case-
>> by-case level. Is there another reason it is vital that the encoding
>> step and/or parameters be reiterated at every instance they are used?
>
> I don't like code redundancy either. But making up a default should  
> only be
> the second step after fixing the semantics of the feature that has  
> this
> default.


I think they're relatively orthogonal. Most of the discussion has been  
about adding new types, new syntax, mutating objects from one type to  
another, etc. and the semantics of doing all that are much less clear  
than "if an encoding is needed, use this one rather than bailing..."

- Robert


_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Idea for automatic encoding and decoding

Reply via email to