Fact. In order to work comfortably and with efficiency with a "scheme for the coding of the characters", can be unicode or any coding scheme, one has to take into account two things: 1) work with a unique set of characters and 2) work with a contiguous block of code points.
At this point, it should be noticed I did not even wrote about the real coding, only about characters and code points. Now, let's take a look at what happens when one breaks the rules above and, precisely, if one attempts to work with multiple characters sets or if one divides - artificially - the whole range of the unicode code points in chunks. The first (and it should be quite obvious) consequence is that you create bloated, unnecessary and useless code. I simplify the flexible string representation (FSR) and will use an "ascii" / "non-ascii" model/terminology. If you are an "ascii" user, a FSR model has no sense. An "ascii" user will use, per definition, only "ascii characters". If you are a "non-ascii" user, the FSR model is also a non sense, because you are per definition a n"on-ascii" user of "non-ascii" character. Any optimisation for "ascii" user just become irrelevant. In one sense, to escape from this, you have to be at the same time a non "ascii" user and a non "non-ascii" user. Impossible. In both cases, a FSR model is useless and in both cases you are forced to use bloated and unnecessary code. The rule is to treat every character of a unique set of characters of a coding scheme in, how to say, an "equal way". The problematic can be seen the other way, every coding scheme has been built to work with a unique set of characters, otherwhile it is not properly working! The second negative aspect of this splitting, is just the splitting itsself. One can optimize every subset of characters, one will always be impacted by the "switch" between the subsets. One more reason to work with a unique set characters or this is the reason why every coding scheme handle a unique set of characters. Up to now, I spoke only about the characters and the sets of characters, not about the coding of the characters. There is a point which is quite hard to understand and also hard to explain. It becomes obvious with some experience. When one works with a coding scheme, one always has to think characters / code points. If one takes the perspective of encoded code points, it simply does not work or may not work very well (memory/speed). The whole problematic is that it is impossible to work with characters, one is forced to manipulate encoded code points as characters. Unicode is built and though to work with code points, not with encoded code points. The serialization, transformation code point -> encoded code point, is "only" a technical and secondary process. Surprise, all the unicode coding schemes (utf-8, 16, 32) are working with the same set of characters. They differ in the serialization, but they are all working with a unique set of characters. The utf-16 / ucs-2 is an interesting case. Their encoding mechanisms are quasi the same, the difference lies in the sets of characters. There is an another way to empiricaly understand the problem. The historical evolution of the coding of the characters. Practically, all the coding schemes have been created to handle different sets of characters or coding schemes have been created, because it is the only way to work properly. If it would have been possible to work with multiple coding schemes, I'm pretty sure a solution would have emerged. It never happened and it would not have been necessary to create iso10646 or unicode. Neither it would have been necessary to create all these codings iso-8859-***, cp***, mac** which are all *based on set of characters*. plan9 had attempted to work with multiple characters set, it did not work very well, main issue: the switch between the codings. A solution à la FSR can not work or not work in a optimized way. It is not a coding scheme, it is a composite of coding schemes handling several characters sets. Hard to imagine something worse. Contrary to what has been said, the bad cases I presented here are not corner cases. There is practically and systematically a regression in Py33 compared to Py32. That's very easy to test. I did all my tests at the light of what I explained above. I was not a suprise for me to this expectidly bad behaviour. Python is not my tool. If I'm allowing me to give an advice, a scientifical approach. I suggest the core devs to firstly spend their time to proof a FSR model can beat the existing models (purely on the C level). Then, if they succeeded, to later implement this. My feeling is that most of the people are defending this FSR simply because it exists, not because of its intrisic quality. Hint: I suggest the experts to take a comprehensive look at the cmap table of the OpenType fonts (pure unicode technology). Those people know how to work. I would be very happy to be wrong. Unfortunately, I'm affraid it's not the case. jmf -- http://mail.python.org/mailman/listinfo/python-list