Re: [Cython] Idea for automatic encoding and decoding

Christopher Barker Wed, 16 Dec 2009 10:37:28 -0800

Speaking as a user that is still confused about many implementation 
issues...

Greg Ewing wrote:
> Suppose we have a way of expressing a type parameterised
> with an encoding, maybe something like
> 
>    encoding[name]
> 
> We could have a few predefined ones, such as
> 
>    ctypedef encoding['ascii'] ascii
>    ctypedef encoding['utf8'] utf8
>    ctypedef encoding['latin1'] latin1

I like this -- something like this would be really helpful.

I posted a similar note a little while back, but to repeat:

 From the users side, I am generally thinking of a given variable as 
working with "text" or "data". For text, a string makes sense, for data, 
a bytes object makes sense. It so happens that in Py2 and C, both are 
stored the same way, which is the source of all this mess.

Nevertheless, if I'm writing a Cython method, I'll know what the nature 
of the data I'm working with, and I'll use the above types for text-like 
data. In the case of ascii, or course, the actual bytes will be the same 
as a bytes object, and that will be a very common case for methods that 
need to pass a char* on to C code, and will work well particularly for 
things like flags and the like -- little ascii strings that are kind of 
like data.

Stefan Behnel wrote:
> I don't think "encoding" is a good name for a type, though. The purpose of
> names of that type is to hold data, not encodings.

How about something like "ustring" ? Am I missing something, or is this 
a lot like a unicode object, but with the encoding statically defined? 
Kind of like a numpy array with the datatype statically defined?

Robert Bradshaw wrote:
> Would
> 
> def flump(utf8 s):
>      return s
> 
> return a bytes object?

I would expect it to return a unicode object -- in Python, I'd expect 
bytes+encoding to be returned as a unicode object -- it's the only way 
not to lose the encoding information. Of course, in Py2, you might 
expect an ascii string returned, rather than unicode or bytes -- arrrgg! 
But I could live with unicode.

Stefan Behnel wrote:
> So most my-data-is-not-unicode users would want to make sure that they
> always get an easy-to-use bytes object on the way in and that the return
> value is an easy-to-use Python value, i.e. it follows the normal platform
> str type: bytes on Py2 and unicode on Py3.

But what about non-ansi encodings? does a string make sense here? The 
python code would then need to know the encoding to do anything 
intelligent with with it.

Greg Ewing wrote:
> Yes, I realize it doesn't fully address your use case.
> It's more aimed at people who think a blanket declaration
> would be too implicit and error-prone.

I agree -- while ascii is such a common case that a blanket declaration 
would be useful, I'd rather declare it where I need it. It seems it 
wouldn't be hard to convert existing code, if what you want is ascii 
everywhere.

Stefan Behnel wrote:
> To fill this with a bit of background, I started writing up a couple of
> thoughts on use cases that I think are relevant here.
> 
> http://wiki.cython.org/enhancements/stringcoercion

Thanks -- I do think that's helpful.

Robert Bradshaw wrote:
> Yep, though they are related. I would imagine that most (though not  
> all) of the time one an API would return the same kind of string it  
> expects.

Sure, though with Python's duck typing, it's pretty common for a 
function to always return a given type, but accept anything that can be 
converted to that type. At least in my code.

Stefan Behnel wrote:
> If that's so clear, then please answer the following: when is an encoding
> needed? Is that only when coercing between char* and Python strings, or
> also when coercing between bytes/unicode?

Certainly the later. You can't convert form bytes to unicode without 
defining an encoding -- can you?

> Will there be a different
> handling for function signatures, or will it work the same everywhere? I.e.
> will a "def func(bytes b)" function always accept unicode,

I don't think it should (aside from maybe backward compatibility--sigh). 
Again, I use bytes for data, unicode for text. Yes, and encoded string 
can be data, and stored in bytes, but I would use that only explicitly.

 > Or will only "def func(char*)" accept unicode input?

Actually, I think char* is really analogous to bytes, and shouldn't 
accept unicode -- again -- dangerous without encoding information.

I guess the short version is that coercion between unicode and bytes(or 
char*) should only be done explicitly. That could mean that you've 
explicitly defined a global encoding, but I think that's too subtle, 
really, I'd rather it was declared where it was used.

-Chris

-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

[email protected]
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Idea for automatic encoding and decoding

Reply via email to