On Sat, 18 Feb 2006 09:59:38 +0100, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= 
<[EMAIL PROTECTED]> wrote:

>Aahz wrote:
>> The problem is that they don't understand that "Martin v. L?wis" is not
>> Unicode -- once all strings are Unicode, this is guaranteed to work.
Well, after all the "string" literal escapes that were being used
to define byte values are all rewritten, yes, I'll believe the guarantee ;-)
(BTW, are there plans for migration tools?)

Ok, now back to the s/bytes/octet/ topic:
>
>This specific call, yes. I don't think the problem will go away as long
>as both encode and decode are available for both strings and byte
>arrays.
>
>> While it's not absolutely true, my experience of watching Unicode
>> confusion is that the simplest approach for newbies is: encode FROM
>> Unicode, decode TO Unicode.
>
>I think this is what should be in-grained into the library, also. It
>shouldn't try to give additional meaning to these terms.
>
Thinking about bytes recently, it occurs to me that bytes are really not 
intrinsically
numeric in nature. They don't necessarily represent uint8's. E.g., a binary 
file is
really a sequence of bit octets in its most primitive and abstract sense.

So I'm wondering if we shouldn't have an octet type analogous to unicode, and 
instances of octet
would be vectors of octets as abstract 8-bit bit vectors, like instances of 
unicode are vectors of abstract characters.

If you wanted integers you could map ord for integers guaranteed to be in 
range(256).
The constructor would naturally take any suitable integer sequence so 
octet([65,66,67]) would work.

In general, all encode methods would produce an octet instance, e.g. 
unicode.encode.
octet.decode(octet_instance, 'src_encoding') or 
octet_instance.decode('src_encoding') would do
all the familiar character code sequence decoding,
e.g., octet.decode(oseq, 'utf-8') or oseq.decode('utf-8') to make a unicode 
instance.

Going from unicode, unicode.encode(uinst, 'utf-8') or uinst.encode('utf-8') 
would produce an octet instance.
I think this is conceptually purer than the current bytes idea, since the 
result really has no arithmetic significance.

Also, ord would work on a length-one octet instance, and produce the unsigned 
integer value you'd expect, but would fail
if not length-one, like ord on unicode (or current str).

Thus octet would replace bytes as the binary info container, and would not have 
any presumed aritmetic
significance, either as integer or as 
character-of-current-source-encoding-inferred-from-integer-value-as-ord.

To get a text representation of octets, hex is natural, e.g., octet('6162 
6380') # spaces ignored
so repr(octet('a deaf bee')) => "octet('adeafbee')" and 
octet('616263').decode('ascii') => u'abc' and
back: u'abc.encode('ascii') => octet('616263'). The base64 codec looks 
conceptually cleaner too, so long
as you keep in mind base64 as a character subset of unicode and the name of the 
transformation function pair.
octet('616263').decode('base64') => u'YWJj\n' # octets -> characters
u'YWJj\n'.encode('base64') => octet('616263') # characters -> octets

If you wanted integer-nature bytes, you could have octet codecs for uint8 and 
int8, e.g., octseq.decode('int8')
could produce a list of signed integers all in range(-128,128). Or maybe 
map(dec_int8, octseq). The array
module could easily be a target for octet.decode, e.g., 
octseq.decode('array_B') or octet.decode(octseq, 'array_B'),
and octet(array_instance) the other way.

Likewise, other types could be destination for octet.decode.

E.g., if you had an abstraction for a display image one could have 'gif' and 
'png' and 'bmp' etc
be like 'cp437', 'latin-1', and 'utf-8' etc are for decoding octest to unicode, 
and write stuff like

    o_seq = open('pic.gif','rb')  # makes octet instance
    img = o_seq.decode('gif89')   # => img is abstract, internally represented 
suitably but hidden, like unicode.
    open('pic.png', 'wb').write(img.encode('png'))

UIAM PIL has this functionality, if not as encode/decode methods.

Similarly, there could be an abstract archive container, and you could have

    arch = open('tree.tgz','rb').decode('tgz') # => might do lazy things 
waiting for encode
    egg_octets = arch.encode('python_egg')  # convert to egg format?? (just 
hand-waving ;-)

Probably all it would take is to wrap some things in abstract-container (AC) 
types, to enforce the protocol.
Image(octet_seq, 'gif') might produce an AC that only saved a (octet_seq, 
'gif') internally, or it might
do eager conversion per optional additional args. Certainly .bmp without rle 
can be hugely wasteful.

For flexibility like eager vs not, or perhaps returning an iterator instead of 
a byte sequence,
I guess the encode/decode signatures should be (enc, *args, **kw) and pass 
those things on to
the worker functions? An abstract container could have a "pack" codec to do 
serial composition/decomposition.

I'm sure Mal has all this stuff one way or another, but I wanted the conceptual 
purity of AC instances ac in
ac = octet_seq.decode('src_enc'); octet_seq  = ac.encode('dst_enc') ;-)

Bottom line thought: binary octets aren't numeric ;-)

Regards,
Bengt Richter

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to