Hi Ludo, > > Otherwise LGTM. I checked some other distros and they seem to have this > > enabled. Thanks! > > That means that strings are internally UCS-4-encoded, right? What’s the > rationale, and what happens when this flag is omitted?
The CPython C interface changes depending on the flag and some Python extensions don't work with the narrow UTF-16 Unicode - which is what it would use if you don't specify. The default, UTF-16, is basically just historical baggage from when Unicode had fewer than 65536 codepoints in the standard. The max codepoint used nowadays is 1114111. UCS-4 encoding means that just one 32-bit word encodes one Unicode codepoint (it's 1:1). It's the most straightforward encoding if you don't care about size wastage. If you *do* care about size wastage, you use UTF-8. Only if you are tied down by some kind of backward compatibility constraints you use UTF-16 or UCS-2 (the latter doesn't even have some way to encode codepoints over 65535 AT ALL - but UTF-16 uses a variable-length encoding to represent those). Python Unicode string builds on Microsoft Windows and Mac OS X usually use UTF-16 while on GNU Linux distributions we usually use UCS-4. Python 3 does the obvious thing and has only one string class and switches the internal string encoding depending on what codepoints are used. That way the user is none the wiser and it still saves space. But Python 2.7 still has "strings" and "unicode strings" which are disjunct with no such optimizations. So this patch basically just makes sure that we do the same as other distributions so that all the Python 2.7 extensions work.