Re: [Cython] Another string encoding idea

Robert Bradshaw Mon, 30 Nov 2009 23:43:35 -0800

On Nov 30, 2009, at 10:41 PM, Stefan Behnel wrote:

>
> Robert Bradshaw, 01.12.2009 04:09:
>> Just to clarify discussion, here is what I'm proposing (which is  
>> still
>> in flux, and simplified due to memory issues, which does make it less
>> attractive as one does not get to choose the used encoding, but it
>> would always be UTF-8 in Py3).
>
> ... and the 'default encoding' in Py2, which may or may not be  
> ASCII, but
> would likely be at least something that's compatible with ASCII, as it
> would break tons of code otherwise.


Yep.

>
>
>> Without directive(s) (as it is now):
>>
>>    char* <-> bytes
>>
>> With the directive(s) (which can be applied locally or globally):
>>
>>     char* <-> str
>>     unicode/bytes -> char* would also work (for Py2/Py3 respectively)
>
> 'respectively' in the sense of 'for both'?

I was just avoiding being redundant with the case covered by char* <->  
str. Yes, 'for both' would be accurate to say as well.

>
>> The encoding used would be the system default (in Py2) and UTF-8 (in
>> Py3). This would use the defenc slot so the encoded char* would be
>> valid as long as the unicode object is around, and the long term
>> future of the defenc slot needs to be ensured before this could be
>> used for non-arguments conversion.
>
> That's my main concern here. We are basing a major feature on a side- 
> effect
> of something that's declared "for internal use only".
>
> The new buffer interface isn't even supported by Unicode strings in  
> Py3, so
> the mere existence of the defenc slot in Py3 is plainly for internal
> optimisation purposes, and the fact that it's safe for external code  
> to
> just borrow the reference into a char* is everything but clear to me.
>
> It's obvious enough that defenc isn't going to go away in Py2 any  
> more, but
> since you keep insisting, please ask on python-dev for making that  
> part of
> the C-API publicly specified (i.e. the slot itself and the fact that  
> the
> object in defenc is kept alive for the lifetime of the unicode string)
> before we even consider doing anything like this.
>
> I still don't like the list.pop() optimisation, but this is much  
> worse, as
> we can't just take this feature back when we realise that it was a  
> mistake
> in the first place.

This *is* a concern of mine as well (though I didn't know it was being  
deprecated when I first thought of using it), and much of the proposal  
is conditional on defenc, as we need it, is not going away. Until  
that's resolved, there's no way this should go in.

If it's going away, then we can still handle argument parameters, and  
char* -> object, but the spontaneous unicode -> char* suffers the  
aforementioned technical difficulties and may have to be abandoned.  
(An advantage would also be that we could actually support a variety  
of encodings, not just the "default" one, if there's interest.)

>
>
>> Also out there is the idea of a directive that would make char*  
>> become
>> unicode in both Py2 and Py3.
>
> ... which would likely only be useful for new code, as existing code  
> would
> break in all sorts of places if you enable that (just as with type  
> inference).

Yeah, I think this is particular one is a more invasive change, and I  
probably wouldn't want it tied with the first, but I was trying to  
summarize.

>
>
>> On Nov 29, 2009, at 8:47 AM, Stefan Behnel wrote:
>>
>>> Robert Bradshaw, 28.11.2009 22:12:
>>>> My personal concern is the pain I see porting Sage to Py3. I'd have
>>>> to go through the codebase and throw in encodes() and decodes() and
>>>> change signatures of functions that take char* arguments
>>> That's what I figured. Instead of having to fix up the code, you  
>>> want
>>> a do-what-I-mean str data type that unifies everything that's  
>>> unicode,
>>> bytes and char*, and that magically handles it all for you.
>>
>> Exactly. Improve the compiler rather than change the code.
>
> You calling it 'improve' actually makes it sound better than I think  
> it is.

Yeah, that colors it from my perspective. Maybe "enhance" would be a  
better word.

> I do see the interest of simplifying the path between unicode  
> strings and
> char*, but I also see an interest in making it easy for developers  
> to write
> safe APIs that reject broken input (e.g. with 0 bytes or other control
> characters). I really don't like APIs that use "well, it's written  
> in C"
> (and certainly not "well, it's written in Cython"!) as an excuse for
> silently dropping parts of my accidentally broken input (which I may  
> not
> even have control of myself).

Yeah, zero bytes are something that no coercion of object to char* can  
handle, as char* is short on information. We have this issue now. The  
SIMD/vector/memoryview types might be a way to hold pointer + length  
in a single object.

> Automatic coercion to char* is only one side
> of input handling, and it may just as well lead to less helpful APIs  
> being
> written. So enabling such a directive requires careful  
> consideration, too,
> because it's not a simple all-win thing, not even in the long term.

Yes, I agree. Unlike type inference, backwards compatibility is not  
the only significant motivation for not having this on by default.

>> I think it's easier if the Python to C and C to Python conversions  
>> are
>> uniform whether it happen via to coercion, assignment, or function
>> signature constraints. Then the question is what objects can be  
>> turned
>> into a char* (the directive would add unicode) and what object does
>> char* turn into (the directive would create str in Py2 and Py3).
>> [...]
>> If we declare
>>
>>     some_python_name = some_c_string
>>
>> to always have the same meaning as
>>
>>     some_python_name = <typeof(some_python_name)>some_c_string
>>
>> then the meaning of <bytes> some_c_string and <unicode> some_c_string
>> are clear, and <object> some_c_string is the only ambiguity, and the
>> directive would control what <object> some_c_string means.
>
> This sounds reasonable - except for the implementation details.
>
>
>> Function arguments typed as char* are a particularly useful case
>> though, and it would be nice to make this friendlier for Py3.
>
> Function arguments typed bytes/str/unicode are a lot easier and  
> safer to
> handle, though, and not a bit slower in general. Coercion from bytes  
> to
> plain char* is pretty fast, and could be even faster if it's typed  
> (as we
> could use a None check and a macro in that case).

(As an asside, ironically, a type check is often just as fast as a  
None check, as I've noticed elsewhere statically typing things...)

>>> Now, the proposal was to enable this with a compiler directive,  
>>> which
>>> would basically provide a default encoding. If this directive was
>>> used, all untyped coercions from char* to a Python object would use
>>> it. As Dag noted already, this would interfere with type  
>>> inference, as
>>> the resulting type would still be char* in that case.
>>
>> This is completely orthogonal to type inference.
>
> It's not orthogonal, as type inference currently breaks C type to  
> untyped
> Python name assignments, which is exactly the case you want to  
> influence
> with the directive. This means that the char* directive would  
> override the
> type inference directive for one special case.

I was just using assignment to an untyped variable as an implicit  
coercion to object in my example. I should have been more explicit and  
written

     cdef char* ss = ...
     cdef object x = ss

If type inference is enabled, then

     cdef char* ss = ...
     x = ss

(assuming no other assignments to x) will result in x being typed as  
char*, so no coercion is triggered to do the assignment. The proposal  
is to control what kind of object gets created if a char* needs to  
become an object. All types are inferred and completely resolved  
before any coercions get inserted.

- Robert

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Another string encoding idea

Reply via email to