Re: Unicode in C

Meir Kriheli Tue, 13 Mar 2012 04:19:42 -0700

Hi,

2012/3/13 Elazar Leibovich <elaz...@gmail.com>


> 2012/3/13 kobi zamir <kobi.za...@gmail.com>
>
>>
>>
>>> So I guess that you're also in the UTF-8 camp.
>>>
>>
>> yes, but my opinion about utf-8 is just my opinion. i like python and
>> python defaults to utf-8.
>>
>
> Python's internal representation is not UTF-8, but UTF-16, or UTF-32,
> depends on build parameters. Thus python doesn't really support code points
> above the BMP.
> Of course, you cannot know the internal representation, since python
> (cleverly) does not allow you to cast a unicode string to a sequence of
> bytes without specifying the result encoding.
>
> http://docs.python.org/c-api/unicode.html
>
> (see also this very good 
> presentation<http://98.245.80.27/tcpc/OSCON2011/gbu.html>on internal unicode 
> representations in various languages).
>
>
Nitpick: It's actually ucs2/ucs4 (which preceded the above but are
compatible).

Actually one can know the internal representation by checking
sys.maxunicode [1]. I'm using it in python-bidi to manually handle
surrogate pairs if needed [2].

[1] http://docs.python.org/dev/library/sys.html#sys.maxunicode
[2]
https://github.com/MeirKriheli/python-bidi/blob/master/src/bidi/algorithm.py#L46

Cheers
-- 
Meir

_______________________________________________
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

Reply via email to