On Jun 21, 2010, at 01:17 PM, P.J. Eby wrote: >I'm not really sure how much use the encoding is on a unicode object - what >would it actually mean? > >Hm. I suppose it would effectively mean "this string can be represented in >this encoding" -- which is useful, in that you could fail operations when >combining with bytes of a different encoding.
That's basically what I was thinking. >Hm... no, in that case you should just encode the string to the bytes' >encoding, and let that throw an error if it fails. So, really, there's no >reason for a string to know its encoding. All you need is the bytes type to >have an encoding attribute, and when doing mixed-type operations between >bytes and strings, coerce to *bytes of the same encoding*. If ebytes were a separate type, and it did the encoding check at constructor time, and the results of the decoding were cached, then I think you would not need the equivalent of an estr type. If you had a string and knew what it could be encoded to, then you could just coerce it to an ebytes and use the cached decoded value wherever you needed it. E.g. >>> mystring = 'some unicode string' >>> myencoding = 'iso-9999-foo' >>> myebytes = ebytes(mystring, myencoding) >>> myebytes.encoding == myencoding True >>> myebytes.string == mystring True So ebytes() could accept a str or bytes as its first argument. >>> mybytes = b'some encoded string' >>> myebytes = ebytes(mybytes, myencoding) >>> mybytes == myebytes True >>> myebytes.encoding == myencoding True In the first example ebytes() encodes mystring to set the internal bytes representation. In the second example, ebytes() decodes the bytes to get the .string attribute value. In both cases, an exception is raised if the encoding/decoding fails. >However, if .encoding is None, then coercion would follow the same rules as >now -- i.e., convert the bytes to unicode, assuming an ascii encoding. (This >would be different than setting an encoding of 'ascii', because in that case, >it means you want cross-type operations to result in ascii bytes, rather than >a unicode string, and to fail if the unicode part can't be encoded >appropriately. The 'None' setting is effectively a nod to compatibility with >prior 3.x versions, since I assume we can't just throw out the old coercion >behavior.) > >Then, a few more changes to the bytes type would round out the implementation: > >* Allow .decode() to not specify an encoding, unless .encoding is None > >* Add back in the missing string methods (e.g. .encode()), since you can >transparently upgrade to a string) > >* Smart __str__, as shown in your proposal. If my example above isn't nonsense, then __str__() would just return the .string attribute. >In short, +1. (I wish it were possible to go back and make bytes non-strings >and have only this ebytes or bstr or whatever type have string methods, but >I'm pretty sure that ship has already sailed.) Maybe it's PEP time? No, I'm not volunteering. ;) -Barry
signature.asc
Description: PGP signature
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com