Re: [Python-Dev] bytes / unicode

Stephen J. Turnbull Mon, 21 Jun 2010 09:14:47 -0700

Lennart Regebro writes:

 > 2010/6/21 Stephen J. Turnbull <[email protected]>:
 > > IMO, the UI is right.  "Something" like the above "ought" to work.
 > 
 > Right. That said, many times when you want to do urlparse etc they
 > might be binary, and you might want binary. So maybe the methods
 > should work with both?


First, a caveat: I'm a Unicode/encodings person, not an experienced
web programmer.  My opinions on whether this would work well in
practice should be taken with a grain of salt.

Speaking for myself, I live in a country where the natives have
saddled themselves with no less than 4 encodings in common use, and I
would never want "binary" since none of them would display as anything
useful in a traceback.  Wherever possible, I decode "blobs" into
structured objects, I do it as soon as possible, and if for efficiency
reasons I want to do this lazily, I store the blob in a separate
.raw_object attribute.  If they're textual, I decode them to text.  I
can't see an efficiency argument for decoding URIs lazily in most
applications.

In the case of structured text like URIs, I would create a separate
class for handling them with string-like operations.  Internally, all
text would be raw Unicode (ie, not url-encoded); repr(uri) would use
some kind of readable quoting convention (not url-encoding) to
disambiguate random reserved characters from separators, while
str(uri) would produce an url-encoded string.  Converting to and from
wire format is just .encode and .decode, then, and in this country you
need to be flexible about which encoding you use.

Agreed, this stuff is really annoying.  But I think that just comes
with the territory.  PJE reports that folks don't like doing encoding
and decoding all over the place.  I understand that, but if they're
doing a lot of that, I have to wonder why.  Why not define the one
line function and get on with life?

The thing is, where I live, it's not going to be a one line function.
I'm going to be dealing with URLs that are url-encoded representations
of UTF-8, Shift-JIS, EUC-JP, and occasionally RFC 2047!  So I need an
API that explicitly encodes and decodes.  And I need an API that
presents Japanese as Japanese rather than as line noise.

Eg, PJE writes

    Ugh.  I meant: 

    newurl = urljoin(str(base, 'latin-1'), 'subdir').encode('latin-1')

    Which just goes to the point of how ridiculous it is to have to  
    convert things to strings and back again to use APIs that ought to  
    just handle bytes properly in the first place. 

But if you need that "everywhere", what's so hard about

def urljoin_wrapper (base, subdir):
    return urljoin(str(base, 'latin-1'), subdir).encode('latin-1')

Now, note how that pattern fails as soon as you want to use
non-ISO-8859-1 languages for subdir names.  In Python 3, the code
above is just plain buggy, IMHO.  The original author probably will
never need the generalization.  But her name will be cursed unto the
nth generation by people who use her code on a different continent.

The net result is that bytes are *not* a programmer- or user-friendly
way to do this, except for the minority of the world for whom Latin-1
is a good approximation to their daily-use unibyte encoding (eg, it's
probably usable for debugging in Dansk, but you won't win any
popularity contests in Tel Aviv or Shanghai).

_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

Reply via email to