On Sat, 18 Feb 2006 09:59:38 +0100, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <[EMAIL PROTECTED]> wrote:
>Aahz wrote: >> The problem is that they don't understand that "Martin v. L?wis" is not >> Unicode -- once all strings are Unicode, this is guaranteed to work. Well, after all the "string" literal escapes that were being used to define byte values are all rewritten, yes, I'll believe the guarantee ;-) (BTW, are there plans for migration tools?) Ok, now back to the s/bytes/octet/ topic: > >This specific call, yes. I don't think the problem will go away as long >as both encode and decode are available for both strings and byte >arrays. > >> While it's not absolutely true, my experience of watching Unicode >> confusion is that the simplest approach for newbies is: encode FROM >> Unicode, decode TO Unicode. > >I think this is what should be in-grained into the library, also. It >shouldn't try to give additional meaning to these terms. > Thinking about bytes recently, it occurs to me that bytes are really not intrinsically numeric in nature. They don't necessarily represent uint8's. E.g., a binary file is really a sequence of bit octets in its most primitive and abstract sense. So I'm wondering if we shouldn't have an octet type analogous to unicode, and instances of octet would be vectors of octets as abstract 8-bit bit vectors, like instances of unicode are vectors of abstract characters. If you wanted integers you could map ord for integers guaranteed to be in range(256). The constructor would naturally take any suitable integer sequence so octet([65,66,67]) would work. In general, all encode methods would produce an octet instance, e.g. unicode.encode. octet.decode(octet_instance, 'src_encoding') or octet_instance.decode('src_encoding') would do all the familiar character code sequence decoding, e.g., octet.decode(oseq, 'utf-8') or oseq.decode('utf-8') to make a unicode instance. Going from unicode, unicode.encode(uinst, 'utf-8') or uinst.encode('utf-8') would produce an octet instance. I think this is conceptually purer than the current bytes idea, since the result really has no arithmetic significance. Also, ord would work on a length-one octet instance, and produce the unsigned integer value you'd expect, but would fail if not length-one, like ord on unicode (or current str). Thus octet would replace bytes as the binary info container, and would not have any presumed aritmetic significance, either as integer or as character-of-current-source-encoding-inferred-from-integer-value-as-ord. To get a text representation of octets, hex is natural, e.g., octet('6162 6380') # spaces ignored so repr(octet('a deaf bee')) => "octet('adeafbee')" and octet('616263').decode('ascii') => u'abc' and back: u'abc.encode('ascii') => octet('616263'). The base64 codec looks conceptually cleaner too, so long as you keep in mind base64 as a character subset of unicode and the name of the transformation function pair. octet('616263').decode('base64') => u'YWJj\n' # octets -> characters u'YWJj\n'.encode('base64') => octet('616263') # characters -> octets If you wanted integer-nature bytes, you could have octet codecs for uint8 and int8, e.g., octseq.decode('int8') could produce a list of signed integers all in range(-128,128). Or maybe map(dec_int8, octseq). The array module could easily be a target for octet.decode, e.g., octseq.decode('array_B') or octet.decode(octseq, 'array_B'), and octet(array_instance) the other way. Likewise, other types could be destination for octet.decode. E.g., if you had an abstraction for a display image one could have 'gif' and 'png' and 'bmp' etc be like 'cp437', 'latin-1', and 'utf-8' etc are for decoding octest to unicode, and write stuff like o_seq = open('pic.gif','rb') # makes octet instance img = o_seq.decode('gif89') # => img is abstract, internally represented suitably but hidden, like unicode. open('pic.png', 'wb').write(img.encode('png')) UIAM PIL has this functionality, if not as encode/decode methods. Similarly, there could be an abstract archive container, and you could have arch = open('tree.tgz','rb').decode('tgz') # => might do lazy things waiting for encode egg_octets = arch.encode('python_egg') # convert to egg format?? (just hand-waving ;-) Probably all it would take is to wrap some things in abstract-container (AC) types, to enforce the protocol. Image(octet_seq, 'gif') might produce an AC that only saved a (octet_seq, 'gif') internally, or it might do eager conversion per optional additional args. Certainly .bmp without rle can be hugely wasteful. For flexibility like eager vs not, or perhaps returning an iterator instead of a byte sequence, I guess the encode/decode signatures should be (enc, *args, **kw) and pass those things on to the worker functions? An abstract container could have a "pack" codec to do serial composition/decomposition. I'm sure Mal has all this stuff one way or another, but I wanted the conceptual purity of AC instances ac in ac = octet_seq.decode('src_enc'); octet_seq = ac.encode('dst_enc') ;-) Bottom line thought: binary octets aren't numeric ;-) Regards, Bengt Richter _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com