On Dec 12, 2009, at 11:35 PM, Stefan Behnel wrote:
> Robert Bradshaw, 12.12.2009 22:49:
>> Another disadvantage of attaching the encoding to the C signature is
>> that for many declarations, especially ones that could be widely
>> shared (printf, fread, ...) or eventually auto-generated (from a C
>> header file) it doesn't make as much sense to attach an encoding to
>> the C function so much as to the module/function in which its used.
>
> Very good point. So it's actually only part of the internal workings
> of a
> function, not the externally visible signature. Code that calls a
> function
> shouldn't be bothered with the implementation details of that
> function, so
> if it wants to pass anything other than a byte string (in which case
> the
> encoding *is* part of the signature), the encoding used internally
> by the
> function should be completely transparent.
>
> That gets us back to the idea of transparently encoding at function
> call
> boundaries. Actually, this isn't even about function call boundaries
> but
> about the Python call boundary. C functions that take a char* will
> always
> only accept an encoded byte string no matter what, so there is no
> reason to
> pass them a unicode string in the first place. And once the Python
> call
> boundary is passed, module internal code is best served by using byte
> strings anyway, for passing them around internally, for iterating
> over them
> efficiently (at least for ASCII string content and single-byte
> encodings),
> and for passing them to C code. Remember that, in C, char is
> actually an
> integer type, so it won't matter much if iteration returns an
> integer value
> or a byte character value.
>
> So I think the right solution is to support automatic conversion
> *only* at
> the Python call boundary, i.e. for Python function parameters and
> return
> values.
I disagree. Most of the examples here have been very simple, but in
general Python/C boundary need not be cleanly aligned with the Python
call boundary. Some more general examples would be
cdef extern from "foo.h":
cdef cblarg(int i, char*):
def blarg(obj):
cblarg(obj.id, obj.name) # I realize I'm assuming
name is not a dynamically generated attribute...
or even
def barg_all(list L):
for i, a in enumerate(L):
cblarg(i, a)
Of course, this boundary is an important one, and when passing in
arguments there's no current way to implicitly or explicitly do call
to encode.
> Now, parameters are easy as long as we stick with the bytes type,
> for which
> "bytes[encoding='utf-8']" would be an obvious syntax in Cython.
> Function
> return values can be made to work in the same way, by simply
> allowing their
> declaration also for 'def' functions. And ctypedefs would make this
> quite
> writeable, as Greg suggested.
>
> Again, this won't rescue code that was already written, but I think it
> would solve the problem for future code, and existing (unicode
> unaware)
> code could be fixed up relatively easily by replacing char* in Python
> function signatures with "bytes[encoding=...]" or the ctypedef-ed
> equivalent.
>
> Comments?
I'm all for making string encodings easier to use, though as I've said
encode() and decode() seem to be a clean enough solution for nearly
everything but argument parsing.
However (and maybe this belongs on the other thread), you are
completely skirting the issue of being able to declare the encoding
for a block of code in one place, rather than having to specify it
every single place it is used. I initially thought your concern with
char* <-> unicode conversion was the ambiguity in what character set
to use, which I was proposing could be declared at a higher than case-
by-case level. Is there another reason it is vital that the encoding
step and/or parameters be reiterated at every instance they are used?
- Robert
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev