Re: [Python-Dev] bytes / unicode

Stephen J. Turnbull Tue, 22 Jun 2010 23:52:01 -0700

Ian Bicking writes:

 > Just for perspective, I don't know if I've ever wanted to deal with a URL
 > like that.


Ditto, I do many times a day for Japanese media sites and Wikipedia.

 > I know how it is supposed to work, and I know what a browser does
 > with that, but so many tools will clean that URL up *or* won't be
 > able to deal with it at all that it's not something I'll be passing
 > around.

I'm not suggesting that is something you want to be "passing around";
it's a presentation form, and I prefer that the internal form use
Unicode.

 > While it's nice to be correct about encodings, sometimes it is
 > impractical.  And it is far nicer to avoid the situation entirely.

But you cannot avoid it entirely.  Processing bytes mean you are
assuming ASCII compatibility.  Granted, this is a pretty good
assumption, especially if you got the bytes off the wire, but it's not
universally so.

Maybe it's a YAGNI, but one reason I prefer the decode-process-encode
paradigm is that choice of codec is a specification of the assumptions
you're making about encoding.  So the Know-Nothing codec described
above assumes just enough ASCII compatibility to parse the scheme.
You could also have codecs which assume just enough ASCII
compatibility to parse a hierarchical scheme, etc.

 > That is, decoding content you don't care about isn't just
 > inefficient, it's complicated and can introduce errors.

That depends on the codec(s) used.

 > Similarly I'd expect (from experience) that a programmer using
 > Python to want to take the same approach, sticking with unencoded
 > data in nearly all situations.

Indeed, a programmer using Python 2 would want to do so, because all
her literal strings are bytes by default (ie, if she doesn't mark them
with `u'), and interactive input is, too.  This is no longer so
obvious in Python 3 which takes the attitude that things that are
expected to be human-readable should be processed as str.  The obvious
example in URI space is the file:/// URL, which you'll typically build
up from a user string or a file browser, which will call the os.path
stuff which returns str.

Text editors and viewers will also use str for their buffers, and if
they provide a way to fish out URIs for their users, they'll probably
return str.

I won't pretend to judge the relative importance of such use cases.
But use cases for urllib which naturally favor str until you put the
URI on the wire do exist, as does the debugging presentation aspect.

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

Reply via email to