On Sat, May 18, 2013 at 6:01 AM, Albert-Jan Roskam <fo...@yahoo.com> wrote: > > East Asian languages. But later on Joel Spolsky's "standard" page about > unicode > I read that it goes to 6 bytes. That's what I implied when I mentioned "utf8".
Each surrogate in a UTF-16 surrogate pair is 10 bits, for a total of 20-bits. Thus UTF-16 sets the upper bound on the number of code points at 2**20 + 2**16 (BMP). UTF-8 only needs 4 bytes for this number of codes. > A certain locale implies a certain codepage (on Windows), but where does the > locale > category LC_CTYPE fit in this story? LC_CTYPE is the locale category that classifies characters. In Debian Linux, the English-language locales copy LC_CTYPE from the i18n (internationalization) locale: short: http://goo.gl/Hs8RD http://www.eglibc.org/cgi-bin/viewvc.cgi/trunk/libc/localedata/locales/i18n?view=markup Here's the mapping between the symbolic Unicode names in the latter (e.g. <U0020>) and UTF-8: short: http://goo.gl/cZ3dS http://www.eglibc.org/cgi-bin/viewvc.cgi/trunk/libc/localedata/charmaps/UTF-8?view=markup The i18n locale is defined by the ISO/IEC technical report 14652, as an instance of an upward compatible extension to the POSIX locale specification called the FDCC-set (i.e. Set of Formal Definitions of Cultural Conventions). Here it is in all its glory, if you like reading technical reports: http://www.open-std.org/jtc1/sc22/wg20/docs/n972-14652ft.pdf If that's not enough, here's the POSIX 1003.1 locale spec: short: http://goo.gl/aOJUx http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html > Isn't UCS-2 the internal unicode encoding for CPython (narrow builds)? Narrow builds create UTF-16 surrogate pairs from \U literals, but these aren't treated as an atomic unit for slicing, iteration, or string length. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor