At 10:57 AM 6/20/2010 -0700, Guido van Rossum wrote:
The problem comes exactly where you find it: when *porting* existing
code that uses aforementioned ways to alleviate the pain, you find
that the hacks no longer work and a properly layered design is needed
that clearly distinguishes between which variables contain bytes and
which text.

Actually, I would say that it's more that (in the network protocol case) we *have* bytes, some of which we would like to *treat* as text, yet do not wish to constantly convert back and forth to full-blown unicode -- especially since the protocols themselves designate ASCII or latin-1 at the transport layer (sometimes with odder encodings above, but these already have to be explicitly dealt with by existing code).

While reading over this thread, I'm wondering whether at least my (WSGI-related) problems in this area would be solved by the availability of a type (say "bstr") that was simply a wrapper providing string-like behavior over an underlying bytes, byte array, or memoryview, that would produce objects of compatible type when combined with strings (by encoding them to match).

Then, I could wrap bytes with it to pass them to string operations, and then feed them back into everything else. The bstr type ideally would be directly compatible with bytes I/O, or at least have a .bytes attribute that would be.

It seems like that would reduce WSGI porting issues quite a bit, since it would mostly consist of throwing extra bstr() calls in where things are breaking, and maybe grabbing the .bytes attribute for I/O.

This approach would still be explicit as to what types you're working with, but would not require O(n) *conversions* at every interaction boundary. It would be limited, of course, to single-byte encodings with all characters (0-255) valid.

OTOH, maybe there should just be a bytestrings module with bytestrings.ascii and bytestrings.latin1, and between the two that should cover the network protocol needs quite well.

Actually, if the Python 3 str() constructor could do O(1) conversion for the latin-1 case (i.e., just wrapped the underlying bytes), I would just put, "bstr = lambda x: str(x,'latin-1')" at the top of my programs and have roughly the same effect.

This idea is still a bit half-baked, but a more baked version might be just the ticket for porting stuff that used str to work with bytes in 2.x, if only because writing, e.g.:

     newurl = bstr(urljoin(bstr(base), 'subdir'))

seems so much saner than writing *this* everywhere:

     newurl = str(urljoin(str(base, 'latin-1'), 'subdir'), 'latin-1')

It is perhaps a bit late to propose this idea, since ideally we would also want to use it in 2.x to aid porting. But I'm curious if any other people here experiencing byte/unicode woes in relation to network protocols would find this a solution to their chief frustration. (i.e., that the stdlib often insists now on strings, where effectively bytes were usable before, and thus one must do conversions both coming and going.)

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to