On 08/17/2012 08:21 PM, Ian Kelly wrote: > On Aug 17, 2012 2:58 PM, "Dave Angel" <d...@davea.name> wrote: >> The internal coding described in PEP 393 has nothing to do with latin-1 >> encoding. > It certainly does. PEP 393 provides for Unicode strings to be represented > internally as any of Latin-1, UCS-2, or UCS-4, whichever is smallest and > sufficient to contain the data. I understand the complaint to be that while > the change is great for strings that happen to fit in Latin-1, it is less > efficient than previous versions for strings that do not.
That's not the way I interpreted the PEP 393. It takes a pure unicode string, finds the largest code point in that string, and chooses 1, 2 or 4 bytes for every character, based on how many bits it'd take for that largest code point. Further i read it to mean that only 00 bytes would be dropped in the process, no other bytes would be changed. I take it as a coincidence that it happens to match latin-1; that's the way Unicode happened historically, and is not Python's fault. Am I reading it wrong? I also figure this is going to be more space efficient than Python 3.2 for any string which had a max code point of 65535 or less (in Windows), or 4billion or less (in real systems). So unless French has code points over 64k, I can't figure that anything is lost. I have no idea about the times involved, so i wanted a more specific complaint. > I don't know how much merit there is to this claim. It would seem to me > that even in non-western locales, most strings are likely to be Latin-1 or > even ASCII, e.g. class and attribute and function names. > > The jmfauth rant I was responding to was saying that French isn't efficiently encoded, and that performance of some vague operations were somehow reduced by several fold. I was just trying to get him to be more specific. -- DaveA -- http://mail.python.org/mailman/listinfo/python-list