On 08/19/2012 08:14 AM, wxjmfa...@gmail.com wrote: > Le dimanche 19 août 2012 12:26:44 UTC+2, Chris Angelico a écrit : >> On Sun, Aug 19, 2012 at 8:19 PM, <wxjmfa...@gmail.com> wrote: >> >>> This is precicely the weak point of this flexible >>> representation. It uses latin-1 and latin-1 is for >>> most users simply unusable. >> >> >> No, it uses Unicode, and as an optimization, attempts to store the >> >> codepoints in less than four bytes for most strings. The fact that a >> >> one-byte storage format happens to look like latin-1 is rather >> >> coincidental. >> > And this this is the common basic mistake. You do not push your > argumentation far enough. A character may "fall" accidentally in a latin-1. > The problem lies in these european characters, which can not fall in this > coding. This *is* the cause of the negative side effects. > If you are using a correct coding scheme, like cp1252, mac-roman or > iso-8859-15, you will never see such a negative side effect. > Again, the problem is not the result, the encoded character. The critical > part is the character which may cause this side effect. > You should think "character set" and not encoded "code point", considering > this kind of expression has a sense in 8-bits coding scheme. > > jmf
But that choice was made decades ago when Unicode picked its second 128 characters. The internal form used in this PEP is simply the low-order byte of the Unicode code point. Trying to scan the string deciding if converting to cp1252 (for example) would be a much more expensive operation than seeing how many bytes it'd take for the largest code point. -- DaveA -- http://mail.python.org/mailman/listinfo/python-list