On Mon, Jun 21, 2010 at 11:43:07AM -0400, Barry Warsaw wrote: > On Jun 21, 2010, at 10:20 PM, Nick Coghlan wrote: > > >Something that may make sense to ease the porting process is for some > >of these "on the boundary" I/O related string manipulation functions > >(such as os.path.join) to grow "encoding" keyword-only arguments. The > >recommended approach would be to provide all strings, but bytes could > >also be accepted if an encoding was specified. (If you want to mix > >encodings - tough, do the decoding yourself). > > This is probably a stupid idea, and if so I'll plead Monday morning mindfuzz > for it. > > Would it make sense to have "encoding-carrying" bytes and str types? > Basically, I'm thinking of types (maybe even the current ones) that carry > around a .encoding attribute so that they can be automatically encoded and > decoded where necessary. This at least would simplify APIs that need to do > the conversion. > > By default, the .encoding attribute would be some marker to indicated "I have > no idea, do it explicitly" and if you combine ebytes or estrs that have > incompatible encodings, you'd either throw an exception or reset the .encoding > to IAmConfuzzled. But say you had an email header like: > > =?euc-jp?b?pc+l7aG8pe+hvKXrpcmhqg==?= > > And code like the following (made less crappy): > > -----snip snip----- > class ebytes(bytes): > encoding = 'ascii' > > def __str__(self): > s = estr(self.decode(self.encoding)) > s.encoding = self.encoding > return s > > > class estr(str): > encoding = 'ascii' > > > s = str(b'\xa5\xcf\xa5\xed\xa1\xbc\xa5\xef\xa1\xbc\xa5\xeb\xa5\xc9\xa1\xaa', > 'euc-jp') > b = bytes(s, 'euc-jp') > > eb = ebytes(b) > eb.encoding = 'euc-jp' > es = str(eb) > print(repr(eb), es, es.encoding) > -----snip snip----- > > Running this you get: > > b'\xa5\xcf\xa5\xed\xa1\xbc\xa5\xef\xa1\xbc\xa5\xeb\xa5\xc9\xa1\xaa' ハローワールド! > euc-jp > > Would it be feasible? Dunno. Would it help ease the bytes/str confusion? > Dunno. But I think it would help make APIs easier to design and use because > it would cut down on the encoding-keyword function signature infection. > I like the idea of having encoding information carried with the data. I don't think that an ebytes type that can *optionally* have an encoding attribute makes the situation less confusing, though. To me the biggest problem with python-2.x's unicode/bytes handling was not that it threw exceptions but that it didn't always throw exceptions. You might test this in python2:: t = u'cafe' function(t)
And say, ah my code works. Then a user gives it this:: t = u'café' function(t) And get a unicode error because the function only works with unicode in the ascii range. ebytes seems to have the same pitfall where the code path exercised by your tests could work with:: eb = ebytes(b) eb.encoding = 'euc-jp' function(eb) but the user exercises a code path that does this and fails:: eb = ebytes(b) function(eb) What do you think of making the encoding attribute a mandatory part of creating an ebyte object? (ex: ``eb = ebytes(b, 'euc-jp')``). -Toshio
pgpc4qEcxzofr.pgp
Description: PGP signature
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com