Re: [Cython] Idea for automatic encoding and decoding

Robert Bradshaw Thu, 17 Dec 2009 16:25:43 -0800

On Dec 15, 2009, at 12:06 AM, Stefan Behnel wrote:

> Robert Bradshaw, 15.12.2009 00:40:
>> I don't think users doing input validation are going to stop doing
>> input validation because of an easier str -> char* conversion option.
>> I'm also skeptical that having to manually do str -> byes -> char*
>> encourages input validation. Validation is good. Shunning user
>> friendliness to try to enforce validation is not (in my mind) so  
>> good.
>
> The only case I really care about here are 0 bytes. Besides that case,
> 'bytes' and 'char*' are basically equivalent (or should be, at least),
> except for memory management, which is the main advantage of the  
> bytes type.


The other difference is that introducing object -> object conversions.  
while solving the memory issue, makes the language semantics much  
messier. For example "<bytes>o is o" would no longer always hold, and  
"<bytes?>o" would no longer be shorthand for raising an error if not  
isinstance(o, bytes), and that's just for explicit coercions.

Any magic that happens is much less surprising on the Python/C  
boundary, as it's already obvious something non-trivial is going on  
there. (That being said, something like <bytes[endoding=utf8]>o is  
overt enough to diminish the level of surprise.)

>
>>> I'm not sure simply saying
>>>
>>>   def func(bytes s):
>>>       ...
>>>
>>> plus a global setting somewhere at the top of your code is really
>>> readable
>>> enough as "this function accepts unicode strings which get converted
>>> automatically". And, no, I don't think typing the input parameter as
>>> "str"
>>> is what people want in most cases. I'm really leaning towards the
>>> assumption that most people really *want* bytes as basic string
>>> input type
>>> in their Cython code. Either that, or exactly unicode strings. Not
>>> 'str'.
>>
>> I agree with you for Py3, but Py2 is an important target, arguably
>> more important than Py3 at this point in time (until numpy and the
>> rest of the scientific world moves over), and will be with us for at
>> least a while longer.
>
> In Py2, 'str' is 'bytes', and my statement certainly holds for Py2.
> Honestly, what would you want with an input data type that suddenly
> switches to something completely different when you compile your  
> code in
> Py3? If you want encoded bytes input in Py2, you most likely want  
> encoded
> bytes input in Py3 as well (see the Wiki page I started). And if you  
> want
> unicode in Py2, you surely want unicode in Py3.

I wasn't trying to say people should type their arguments str, I was  
claiming that it's common to want to accept both bytes and unicode in  
Py2. This is what you said in the wiki "For Python 2.x, the code needs  
to deal with both str (bytes) and unicode, whereas it would only  
accept unicode strings (str) in Python 3." so I think we're in  
agreement here.

>
>> I think they're relatively orthogonal. Most of the discussion has  
>> been
>> about adding new types, new syntax, mutating objects from one type to
>> another, etc. and the semantics of doing all that are much less clear
>> than "if an encoding is needed, use this one rather than bailing..."
>
> If that's so clear, then please answer the following: when is an  
> encoding
> needed? Is that only when coercing between char* and Python strings,  
> or
> also when coercing between bytes/unicode?

Only object <-> char* would use a default encoding, everything else  
would be explicit.

> Will there be a different
> handling for function signatures, or will it work the same everywhere?

That depends on how we are able to handle the memory. Ideally the same  
everywhere, but that may not be feasible.

> I.e.
> will a "def func(bytes b)" function always accept unicode, and what  
> is the
> way to disable that?

I was thinking not.

> Or will only "def func(char*)" accept unicode input?

Yep.

> And will the latter still accept bytes input?

Yes. One could make the case that in the treat-all-char*-as-encoded- 
text mode, bytes should be disallowed in Py3.

> Not so clear to me, at least, and certainly not obvious.

Too many other ideas floating around. I should write up a CEP.

- Robert

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Idea for automatic encoding and decoding

Reply via email to