Robert Bradshaw, 12.12.2009 22:49:
> Another disadvantage of attaching the encoding to the C signature is  
> that for many declarations, especially ones that could be widely  
> shared (printf, fread, ...) or eventually auto-generated (from a C  
> header file) it doesn't make as much sense to attach an encoding to  
> the C function so much as to the module/function in which its used.

Very good point. So it's actually only part of the internal workings of a
function, not the externally visible signature. Code that calls a function
shouldn't be bothered with the implementation details of that function, so
if it wants to pass anything other than a byte string (in which case the
encoding *is* part of the signature), the encoding used internally by the
function should be completely transparent.

That gets us back to the idea of transparently encoding at function call
boundaries. Actually, this isn't even about function call boundaries but
about the Python call boundary. C functions that take a char* will always
only accept an encoded byte string no matter what, so there is no reason to
pass them a unicode string in the first place. And once the Python call
boundary is passed, module internal code is best served by using byte
strings anyway, for passing them around internally, for iterating over them
efficiently (at least for ASCII string content and single-byte encodings),
and for passing them to C code. Remember that, in C, char is actually an
integer type, so it won't matter much if iteration returns an integer value
or a byte character value.

So I think the right solution is to support automatic conversion *only* at
the Python call boundary, i.e. for Python function parameters and return
values.

Now, parameters are easy as long as we stick with the bytes type, for which
"bytes[encoding='utf-8']" would be an obvious syntax in Cython. Function
return values can be made to work in the same way, by simply allowing their
declaration also for 'def' functions. And ctypedefs would make this quite
writeable, as Greg suggested.

Again, this won't rescue code that was already written, but I think it
would solve the problem for future code, and existing (unicode unaware)
code could be fixed up relatively easily by replacing char* in Python
function signatures with "bytes[encoding=...]" or the ctypedef-ed equivalent.

Comments?

Stefan
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to