Speaking as a user that is still confused about many implementation issues...
Greg Ewing wrote: > Suppose we have a way of expressing a type parameterised > with an encoding, maybe something like > > encoding[name] > > We could have a few predefined ones, such as > > ctypedef encoding['ascii'] ascii > ctypedef encoding['utf8'] utf8 > ctypedef encoding['latin1'] latin1 I like this -- something like this would be really helpful. I posted a similar note a little while back, but to repeat: From the users side, I am generally thinking of a given variable as working with "text" or "data". For text, a string makes sense, for data, a bytes object makes sense. It so happens that in Py2 and C, both are stored the same way, which is the source of all this mess. Nevertheless, if I'm writing a Cython method, I'll know what the nature of the data I'm working with, and I'll use the above types for text-like data. In the case of ascii, or course, the actual bytes will be the same as a bytes object, and that will be a very common case for methods that need to pass a char* on to C code, and will work well particularly for things like flags and the like -- little ascii strings that are kind of like data. Stefan Behnel wrote: > I don't think "encoding" is a good name for a type, though. The purpose of > names of that type is to hold data, not encodings. How about something like "ustring" ? Am I missing something, or is this a lot like a unicode object, but with the encoding statically defined? Kind of like a numpy array with the datatype statically defined? Robert Bradshaw wrote: > Would > > def flump(utf8 s): > return s > > return a bytes object? I would expect it to return a unicode object -- in Python, I'd expect bytes+encoding to be returned as a unicode object -- it's the only way not to lose the encoding information. Of course, in Py2, you might expect an ascii string returned, rather than unicode or bytes -- arrrgg! But I could live with unicode. Stefan Behnel wrote: > So most my-data-is-not-unicode users would want to make sure that they > always get an easy-to-use bytes object on the way in and that the return > value is an easy-to-use Python value, i.e. it follows the normal platform > str type: bytes on Py2 and unicode on Py3. But what about non-ansi encodings? does a string make sense here? The python code would then need to know the encoding to do anything intelligent with with it. Greg Ewing wrote: > Yes, I realize it doesn't fully address your use case. > It's more aimed at people who think a blanket declaration > would be too implicit and error-prone. I agree -- while ascii is such a common case that a blanket declaration would be useful, I'd rather declare it where I need it. It seems it wouldn't be hard to convert existing code, if what you want is ascii everywhere. Stefan Behnel wrote: > To fill this with a bit of background, I started writing up a couple of > thoughts on use cases that I think are relevant here. > > http://wiki.cython.org/enhancements/stringcoercion Thanks -- I do think that's helpful. Robert Bradshaw wrote: > Yep, though they are related. I would imagine that most (though not > all) of the time one an API would return the same kind of string it > expects. Sure, though with Python's duck typing, it's pretty common for a function to always return a given type, but accept anything that can be converted to that type. At least in my code. Stefan Behnel wrote: > If that's so clear, then please answer the following: when is an encoding > needed? Is that only when coercing between char* and Python strings, or > also when coercing between bytes/unicode? Certainly the later. You can't convert form bytes to unicode without defining an encoding -- can you? > Will there be a different > handling for function signatures, or will it work the same everywhere? I.e. > will a "def func(bytes b)" function always accept unicode, I don't think it should (aside from maybe backward compatibility--sigh). Again, I use bytes for data, unicode for text. Yes, and encoded string can be data, and stored in bytes, but I would use that only explicitly. > Or will only "def func(char*)" accept unicode input? Actually, I think char* is really analogous to bytes, and shouldn't accept unicode -- again -- dangerous without encoding information. I guess the short version is that coercion between unicode and bytes(or char*) should only be done explicitly. That could mean that you've explicitly defined a global encoding, but I think that's too subtle, really, I'd rather it was declared where it was used. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [email protected] _______________________________________________ Cython-dev mailing list [email protected] http://codespeak.net/mailman/listinfo/cython-dev
