> From: eryk...@gmail.com > Date: Fri, 29 Jan 2016 07:14:07 -0600 > Subject: Re: [Tutor] lc_ctype and re.LOCALE > To: tutor@python.org > CC: sjeik_ap...@hotmail.com > > On Thu, Jan 28, 2016 at 2:23 PM, Albert-Jan Roskam > <sjeik_ap...@hotmail.com> wrote: > > Out of curiosity, I wrote the throw-away script below to find a character > > that is classified > > (--> LC_CTYPE) as digit in one locale, but not in another. > > The re module is the wrong tool for this. The re.LOCALE flag is only > for byte strings, and in this case only ASCII 0-9 are matched as > decimal digits. It doesn't call the isdigit() ctype function. Using > Unicode with re.LOCALE is wrong.
Ok, good to know. In my original Python-2 version of the script I did convert the ordinal to a byte string, but it was still a utf-8 byte string. > The current locale doesn't affect the > meaning of a Unicode character. Starting with 3.6 doing this will > raise an exception. I find it strange that specifying either re.LOCALE or re.UNICODE is still the "special case". IMHO it is a historical anomaly that ASCII is the "normal case". Matching accented characters should not require any special flags. > The POSIX ctype functions such as isalnum and isdigit are limited to a > single code in the range 0-255 and EOF (-1). For UTF-8, the ctype > functions return 0 in the range 128-255 (i.e. lead bytes and trailing > bytes aren't characters). Even if this range has valid characters in a > given locale, it's meaningless to use a Unicode value from the Latin-1 > block, unless the locale uses Latin-1 as its codeset. > > Python 2's str uses the locale-aware isdigit() function. However, all > of the locales on my Linux system use UTF-8, so I have to switch to > Windows to demonstrate two locales that differ with respect to > isdigit(). In other words: the LC_CTYPE is only relevant with codepage encodings? You could use PyWin32 or ctypes to iterate over all the > locales known to Windows, if it mattered that much to you. > > The English locale (codepage 1252) includes superscript digits 1, 2, and 3: > > >>> locale.setlocale(locale.LC_CTYPE, 'English_United Kingdom') > 'English_United Kingdom.1252' > >>> [chr(x) for x in range(256) if chr(x).isdigit()] > ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '\xb2', '\xb3', '\xb9'] > >>> unicodedata.name('\xb9'.decode('1252')) > 'SUPERSCRIPT ONE' > >>> unicodedata.name('\xb2'.decode('1252')) > 'SUPERSCRIPT TWO' > >>> unicodedata.name('\xb3'.decode('1252')) > 'SUPERSCRIPT THREE' Is character classification also related to the compatibility form of unicode normalization? >>> unicodedata.normalize("NFKD", u'\xb3') u'3' (see also http://unicode.org/reports/tr15/) > Note that using the re.LOCALE flag doesn't match these superscript digits: > > >>> re.findall(r'\d', '0123456789\xb2\xb3\xb9', re.LOCALE) > ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'] Ok, now I am upset. I did not expect this at all! I would have expected the re results to be in line with the str.isdigit results. If LC_CTYPE is not relevant isn't "re.DIACRITIC" a better name for the re.LOCALE flag? > The Windows Greek locale (codepage 1253) substitutes "Ή" for superscript 1: > > >>> locale.setlocale(locale.LC_CTYPE, 'Greek_Greece') > 'Greek_Greece.1253' > >>> [chr(x) for x in range(256) if chr(x).isdigit()] > ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '\xb2', '\xb3'] > > >>> unicodedata.name('\xb9'.decode('1253')) > 'GREEK CAPITAL LETTER ETA WITH TONOS' Ok, I switched to Windows to see this with my own eyes. Checked the regex. Strange, but fun to know. Thanks a lot for your thorough reply! _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor