Re: [PATCH python-tests] gnu: python-2.7: Enable UCS-4 Unicode encoding.

Danny Milosavljevic Mon, 23 Jan 2017 15:46:59 -0800

Hi Ludo,

> > Otherwise LGTM. I checked some other distros and they seem to have this
> > enabled. Thanks!  
> 
> That means that strings are internally UCS-4-encoded, right?  What’s the
> rationale, and what happens when this flag is omitted?


The CPython C interface changes depending on the flag and some Python 
extensions don't work with the narrow UTF-16 Unicode - which is what it would 
use if you don't specify.

The default, UTF-16, is basically just historical baggage from when Unicode had 
fewer than 65536 codepoints in the standard.

The max codepoint used nowadays is 1114111.

UCS-4 encoding means that just one 32-bit word encodes one Unicode codepoint 
(it's 1:1). It's the most straightforward encoding if you don't care about size 
wastage. 

If you *do* care about size wastage, you use UTF-8.

Only if you are tied down by some kind of backward compatibility constraints 
you use UTF-16 or UCS-2 (the latter doesn't even have some way to encode 
codepoints over 65535 AT ALL - but UTF-16 uses a variable-length encoding to 
represent those).

Python Unicode string builds on Microsoft Windows and Mac OS X usually use 
UTF-16 while on GNU Linux distributions we usually use UCS-4.

Python 3 does the obvious thing and has only one string class and switches the 
internal string encoding depending on what codepoints are used. That way the 
user is none the wiser and it still saves space.

But Python 2.7 still has "strings" and "unicode strings" which are disjunct 
with no such optimizations.

So this patch basically just makes sure that we do the same as other 
distributions so that all the Python 2.7 extensions work.

Re: [PATCH python-tests] gnu: python-2.7: Enable UCS-4 Unicode encoding.

Reply via email to