Marc 'BlackJack' Rintsch wrote:

I don't see the shortcoming in Python <3.0.  If you want real strings
with characters instead of just a bunch of bytes simply use `unicode`
objects instead of `str`.
Fair enough -- that certainly is the best policy.  But working with any
other encoding (sometimes necessary when interfacing with any other
software), it's still a bit of a PITA.

But it has to be.  There is no automagic guessing possible.

Automagic guessing isn't possible if strings keep track of what encoding their data is. And why shouldn't they? We're a long way from the day when a "string" was nothing more than an array of bytes. Adding a teeny bit of metadata makes life much easier.

And does REALbasic really use byte strings plus an encoding!?
You betcha!  Works like a dream.

IMHO a strange design decision.

I get that you don't grok it, but I think that's because you haven't worked with it. RB added encoding data to its strings years ago, and changed the default string encoding to UTF-8 at about the same time, and life has been delightful since then. The only time you ever have to think about it is when you're importing a string from some unknown source (e.g. a socket), at which point you need to tell RB what encoding it is. From that point on, you can pass that string around, extract substrings, split it into words, concatenate it with other strings, etc., and it all Just Works (tm).

In comparison, Python requires a lot more thought on the part of the programmer to keep track of what's what (unless, as you point out, you convert everything into unicode strings as soon as you get them, but that can be a very expensive operation to do on, say, a 500MB UTF-8 text file).

A lot more hassle compared to an opaque unicode string type which uses some internal encoding that makes operations like getting a character at a given index easy or concatenating without the need to reencode.

No. RB supports UCS-2 encoding, too, and is smart enough to take advantage of the fixed character width of any encoding when that's what a string happens to be. And no reencoding is used when it's not necessary (e.g., concatenating two strings of the same encoding, or adding an ASCII string to a string using any ASCII superset, such as UTF-8). There's nothing stopping you from converting all your strings to UCS-2 when you get them, if that's your preference.

But saying that having only one string type that knows it's Unicode, and another string type that hasn't the foggiest clue how to interpret its data as text, is somehow easier than every string knowing what it is and doing the right thing -- well, that's just silly.

Best,
- Joe

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to