Re: [Cython] coercion of char/Py_UNICODE to Python objects - string or integer?

Stefan Behnel Sat, 24 Apr 2010 22:29:05 -0700

Robert Bradshaw, 25.04.2010 07:05:
> On Apr 21, 2010, at 10:37 PM, Stefan Behnel wrote:
>
>> Lisandro Dalcin, 21.04.2010 23:26:
>>> What do you think?
>>>
>>> diff -r 2701901737d4 Cython/Compiler/PyrexTypes.py
>>> --- a/Cython/Compiler/PyrexTypes.py Wed Apr 21 15:36:27 2010 +0200
>>> +++ b/Cython/Compiler/PyrexTypes.py Wed Apr 21 18:25:42 2010 -0300
>>> @@ -871,7 +871,7 @@
>>>       # to integers here.  The maximum value for a Py_UNICODE is
>>>       # 1114111, so PyInt_FromLong() will do just fine here.
>>>
>>> -    to_py_function = "PyInt_FromLong"
>>> +    to_py_function = "PyUnicode_FromOrdinal"
>>>
>>>       def sign_and_name(self):
>>>           return "Py_UNICODE"
>>
>> I didn't know about that function, even though I had looked for it
>> in the
>> CPython docs. It's available in all relevant CPython versions, and
>> it's
>> pretty efficient, too.
>>
>> This would let Py_UNICODE values turn into a single character unicode
>> string when coercing to a Python object. I had also thought about
>> this, and
>> wasn't sure what I wanted. In current Cython, 'char' doesn't coerce
>> to a
>> single character 'bytes' object but to an integer. My thinking was
>> that
>> Py_UNICODE should behave the same.
>>
>> This is a bit inconsistent in itself, given that single character
>> strings
>> can coerce to their C ordinal value, e.g. on comparison with
>> char/Py_UNICODE, but not so much of an inconsistency to break
>> backwards
>> compatibility. I'm really not sure what the 'expected' behaviour is
>> here,
>> although I'm leaning slightly towards the char/bytes and Py_UNICODE/
>> unicode
>> coercion.
>>
>> It's certainly easier to write
>>
>>      cdef Py_UNICODE cval = some_c_integer
>>
>>      py_object =<long>cval
>>
>> to get a Python integer value, than to find, import and call
>> PyUnicode_FromOrdinal() to get a unicode string. There doesn't seem
>> to be
>> an equivalent PyBytes function, so I guess the PyBytes conversion
>> would use
>>
>>      py_bytes = PyBytes_FromStringAndSize(&char_val, 1)
>>
>> which isn't exactly beautiful either, and certainly less so than the
>> opposite
>>
>>      py_integer =<int>char_val
>>
>> This would also speak in favour of letting char and Py_UNICODE
>> coerce to
>> Python strings by default, although the above would go away if we
>> special
>> cased the builtin chr() function to output exactly the above code
>> for each
>> input type.
>>
>> Another option is to consider Py_UNICODE more special (and more
>> specific)
>> than the somewhat generic 'char', and to accept the inconsistency of
>> coercing one to a unicode string and the other to an integer.
>>
>> What do the others think?
>
> I think char ->  bytes and Py_UNICODE ->  unicode make a lot of sense,
> my only concern would be backwards incompatibility.


It occurred to me that the alternative is actually simpler. We can support 
the explicit coercions

     cdef Py_UNICODE uval = ...
     cdef unicode u = uval
     s = <unicode>uval

so the question is just if we want

     py_int_val = uval
     py_ustr_val = <unicode>uval

or

     py_int_val = <int>uval
     py_ustr_val = uval

My gut feeling is that the coercion to strings would be more straight 
forward. It would also clean up the compiler code a bit as implicit 
coercions (e.g. for comparisons) would then work out-of-the-box in both 
ways. Currently, "Py_UNICODE in unicode" must be special cased (which it 
still would in the future, but only for optimisation purposes, not to make 
it work at all).

Stefan
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] coercion of char/Py_UNICODE to Python objects - string or integer?

Reply via email to