Re: encoding problem

Joe Strout Fri, 19 Dec 2008 21:19:01 -0800

Marc 'BlackJack' Rintsch wrote:

I don't see the shortcoming in Python <3.0.  If you want real strings
with characters instead of just a bunch of bytes simply use `unicode`
objects instead of `str`.

Fair enough -- that certainly is the best policy.  But working with any
other encoding (sometimes necessary when interfacing with any other
software), it's still a bit of a PITA.


But it has to be.  There is no automagic guessing possible.

Automagic guessing isn't possible if strings keep track of what encodingtheir data is. And why shouldn't they? We're a long way from the daywhen a "string" was nothing more than an array of bytes. Adding a teenybit of metadata makes life much easier.

And does REALbasic really use byte strings plus an encoding!?

You betcha!  Works like a dream.


IMHO a strange design decision.

I get that you don't grok it, but I think that's because you haven'tworked with it. RB added encoding data to its strings years ago, andchanged the default string encoding to UTF-8 at about the same time, andlife has been delightful since then. The only time you ever have tothink about it is when you're importing a string from some unknownsource (e.g. a socket), at which point you need to tell RB what encodingit is. From that point on, you can pass that string around, extractsubstrings, split it into words, concatenate it with other strings,etc., and it all Just Works (tm).

In comparison, Python requires a lot more thought on the part of theprogrammer to keep track of what's what (unless, as you point out, youconvert everything into unicode strings as soon as you get them, butthat can be a very expensive operation to do on, say, a 500MB UTF-8 textfile).

A lot more hassle compared to an opaqueunicode string type which uses some internal encoding that makesoperations like getting a character at a given index easy orconcatenating without the need to reencode.

No. RB supports UCS-2 encoding, too, and is smart enough to takeadvantage of the fixed character width of any encoding when that's whata string happens to be. And no reencoding is used when it's notnecessary (e.g., concatenating two strings of the same encoding, oradding an ASCII string to a string using any ASCII superset, such asUTF-8). There's nothing stopping you from converting all your stringsto UCS-2 when you get them, if that's your preference.

But saying that having only one string type that knows it's Unicode, andanother string type that hasn't the foggiest clue how to interpret itsdata as text, is somehow easier than every string knowing what it is anddoing the right thing -- well, that's just silly.


Best,
- Joe

--
http://mail.python.org/mailman/listinfo/python-list

Re: encoding problem

Reply via email to