Robert Bradshaw, 12.12.2009 22:49: > Another disadvantage of attaching the encoding to the C signature is > that for many declarations, especially ones that could be widely > shared (printf, fread, ...) or eventually auto-generated (from a C > header file) it doesn't make as much sense to attach an encoding to > the C function so much as to the module/function in which its used.
Very good point. So it's actually only part of the internal workings of a function, not the externally visible signature. Code that calls a function shouldn't be bothered with the implementation details of that function, so if it wants to pass anything other than a byte string (in which case the encoding *is* part of the signature), the encoding used internally by the function should be completely transparent. That gets us back to the idea of transparently encoding at function call boundaries. Actually, this isn't even about function call boundaries but about the Python call boundary. C functions that take a char* will always only accept an encoded byte string no matter what, so there is no reason to pass them a unicode string in the first place. And once the Python call boundary is passed, module internal code is best served by using byte strings anyway, for passing them around internally, for iterating over them efficiently (at least for ASCII string content and single-byte encodings), and for passing them to C code. Remember that, in C, char is actually an integer type, so it won't matter much if iteration returns an integer value or a byte character value. So I think the right solution is to support automatic conversion *only* at the Python call boundary, i.e. for Python function parameters and return values. Now, parameters are easy as long as we stick with the bytes type, for which "bytes[encoding='utf-8']" would be an obvious syntax in Cython. Function return values can be made to work in the same way, by simply allowing their declaration also for 'def' functions. And ctypedefs would make this quite writeable, as Greg suggested. Again, this won't rescue code that was already written, but I think it would solve the problem for future code, and existing (unicode unaware) code could be fixed up relatively easily by replacing char* in Python function signatures with "bytes[encoding=...]" or the ctypedef-ed equivalent. Comments? Stefan _______________________________________________ Cython-dev mailing list [email protected] http://codespeak.net/mailman/listinfo/cython-dev
