On Sat, Nov 16, 2013 at 2:39 AM, Robin Becker <ro...@reportlab.com> wrote: >> Dealing with bytes and Unicode is complicated, and the 2->3 transition is >> not easy, but let's please not spread the misunderstanding that somehow the >> Flexible String Representation is at fault. However you store Unicode code >> points, they are different than bytes, and it is complex having to deal with >> both. You can't somehow make the dichotomy go away, you can only choose >> where you want to think about it. >> >> --Ned. > > ....... > I don't think that's what I said; the flexible representation is just an > added complexity that has come about because of the wish to store strings in > a compact way. The requirement for such complexity is the unicode type > itself (especially the storage requirements) which necessitated some > remedial action. > > There's no point in fighting the change to using unicode. The type wasn't > required for any technical reason as other languages didn't go this route > and are reasonably ok, but there's no doubt the change made things more > difficult.
There's no perceptible difference between a 3.2 wide build and the 3.3 flexible representation. (Differences with narrow builds are bugs, and have now been fixed.) As far as your script's concerned, Python 3.3 always stores strings in UTF-32, four bytes per character. It just happens to be way more efficient on memory, most of the time. Other languages _have_ gone for at least some sort of Unicode support. Unfortunately quite a few have done a half-way job and use UTF-16 as their internal representation. That means there's no difference between U+0012, U+0123, and U+1234, but U+12345 suddenly gets handled differently. ECMAScript actually specifies the perverse behaviour of treating codepoints >U+FFFF as two elements in a string, because it's just too costly to change. There are a small number of languages that guarantee correct Unicode handling. I believe bash scripts get this right (though I haven't tested; string manipulation in bash isn't nearly as rich as a proper text parsing language, so I don't dig into it much); Pike is a very Python-like language, and PEP 393 made Python even more Pike-like, because Pike's string has been variable width for as long as I've known it. A handful of other languages also guarantee UTF-32 semantics. All of them are really easy to work with; instead of writing your code and then going "Oh, I wonder what'll happen if I give this thing weird characters?", you just write your code, safe in the knowledge that there is no such thing as a "weird character" (except for a few in the ASCII set... you may find that code breaks if given a newline in the middle of something, or maybe the slash confuses you). Definitely don't fight the change to Unicode, because it's not a change at all... it's just fixing what was buggy. You already had a difference between bytes and characters, you just thought you could ignore it. ChrisA -- https://mail.python.org/mailman/listinfo/python-list