Re: sre is broken in SuSE 9.2
Denis S. Otkidach wrote: You are right. But isalpha behavior looks strange for me anyway: why cyrillic character '\u0430' is recognized as alpha one for de_DE locale, but is not for C? In glibc, all "real" locales are based on /usr/share/locale/i18n/locales/i18n, e.g. for de_DE through LC_CTYPE copy "i18n" i18n includes U+0430 as a character, through lower / ... % TABLE 11 CYRILLIC/ ..;..(2)..;/ This makes U+0430 a letter in all locales including i18n (unless locally overridden). This entire approach apparently is based on ISO 14652, which, in section 4.3.3, introduces the "i18n" LC_CTYPE category. Why the C locale does not use i18n, I don't know. Most likely, the intention is that the "C" locale works without any additional data files - you should ask the glibc developers. OTOH, there is a definition file POSIX for what appears to be the POSIX locale. I'd like to point out that this implementation is potentially in violation of ISO 14652; annex A.2.2 says that the notion of a POSIX locale is replaced with the i18n FDCC-set. So accordingly, I would expect that i18n is used in POSIX as well - see for yourself that it isn't in glibc 2.3.2. Again, I suggest to ask the glibc developers as to why this is so. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
Serge Orlov wrote: Emphasis is mine. So how many libc implementations with non-unicode wide-character codes do we have in 2005? Solaris has supported 2-byte wchar_t implementations for many years, and so I believe did HP-UX and AIX. ISO C99 defines a constant __STDC_ISO_10646__ which an implementation can use to indicate that wchar_t uses Unicode (aka ISO 10646) in all locales. Very few implementations define this constant at this time, though. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
On Fri, 11 Feb 2005 18:49:53 +0100 "Fredrik Lundh" <[EMAIL PROTECTED]> wrote: > >> >>> re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430') > >> [u'\xb5\xba\xe4\u0430'] > > > > I can't find the strict definition of isalpha, but I believe average > > C program shouldn't care about the current locale alphabet, so isalpha > > is a union of all supported characters in all alphabets > > btw, what does isalpha have to do with this example? The same problem is with isalpha. In most distributions: >>> for c in u'\xb5\xba\xe4\u0430': print c.isalpha(), ... True True True True And in SuSE 9.2: >>> for c in u'\xb5\xba\xe4\u0430': print c.isalpha(), ... False False False False -- Denis S. Otkidach http://www.python.ru/ [ru] -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
Serge Orlov wrote: > The wide-character value for each member of the Portable > Character Set will equal its value when used as the lone character > in an integer character constant. Wide-character codes for other > characters are locale- and *implementation-dependent* > > Emphasis is mine. the relevant part for this thread is *locale-*. if wctype depends on the locale, it cannot be used for generic build. (custom interpreters are an- other thing, but they shouldn't be shipped as "python"). -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
Fredrik Lundh wrote: > Serge Orlov wrote: > >> re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430') >> [u'\xb5\xba\xe4\u0430'] >> >> I can't find the strict definition of isalpha, but I believe average >> C program shouldn't care about the current locale alphabet, so >> isalpha is a union of all supported characters in all alphabets > > nope. isalpha() depends on the locale, as does all other ctype > functions (this also applies to wctype, on some platforms). I mean "all supported characters in all alphabets [in the current locale]". For example in ru_RU.koi8-r isalpha should return true for characters in English and Russian alphabets. In ru_RU.koi8-u -- for characters in English, Russia and Ukrain alphabets, in ru_RU.utf-8 -- for all supported by the implementation alphabetic characters in unicode. IMHO iswalpha in POSIX locale can return true for all alphabetic characters in unicode instead of being limited by English alphabet. Serge. true in -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
Fredrik Lundh wrote: > Serge Orlov wrote: > >> re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430') >> [u'\xb5\xba\xe4\u0430'] >> >> I can't find the strict definition of isalpha, but I believe average >> C program shouldn't care about the current locale alphabet, so >> isalpha is a union of all supported characters in all alphabets > > btw, what does isalpha have to do with this example? It has to do with this thread. u'\xe4'.isalpha() returns false in Suse. It's in the same boat as \w Serge. -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
"Martin v. Löwis" wrote: > Serge Orlov wrote: > > To summarize the discussion: either it's a bug in glibc or there > is an >> option to specify modern POSIX locale. POSIX locale consist of >> characters from the portable character set, unicode is certainly >> portable. > > Yes, but U+00E4 is not in the portable character set. The portable > character set is defined here: > > http://www.opengroup.org/onlinepubs/007908799/xbd/charset.html Thanks for the link. They write (in 1997 or earlier ?): The wide-character value for each member of the Portable Character Set will equal its value when used as the lone character in an integer character constant. Wide-character codes for other characters are locale- and *implementation-dependent* Emphasis is mine. So how many libc implementations with non-unicode wide-character codes do we have in 2005? I'm really interested to know. Serge. -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
Serge Orlov wrote: > To summarize the discussion: either it's a bug in glibc or there is an option to specify modern POSIX locale. POSIX locale consist of characters from the portable character set, unicode is certainly portable. Yes, but U+00E4 is not in the portable character set. The portable character set is defined here: http://www.opengroup.org/onlinepubs/007908799/xbd/charset.html Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
On Sat, 12 Feb 2005 09:42:41 +0100 "Fredrik Lundh" <[EMAIL PROTECTED]> wrote: > the relevant part for this thread is *locale-*. if wctype depends on > the locale, it cannot be used for generic build. (custom interpreters > are an- other thing, but they shouldn't be shipped as "python"). You are right. But isalpha behavior looks strange for me anyway: why cyrillic character '\u0430' is recognized as alpha one for de_DE locale, but is not for C? -- Denis S. Otkidach http://www.python.ru/ [ru] -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
Serge Orlov wrote: >> >>> re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430') >> [u'\xb5\xba\xe4\u0430'] > > I can't find the strict definition of isalpha, but I believe average > C program shouldn't care about the current locale alphabet, so isalpha > is a union of all supported characters in all alphabets nope. isalpha() depends on the locale, as does all other ctype functions (this also applies to wctype, on some platforms). -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
Serge Orlov wrote: >> >>> re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430') >> [u'\xb5\xba\xe4\u0430'] > > I can't find the strict definition of isalpha, but I believe average > C program shouldn't care about the current locale alphabet, so isalpha > is a union of all supported characters in all alphabets btw, what does isalpha have to do with this example? -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
Denis S. Otkidach wrote: > On 10 Feb 2005 11:49:33 -0800 > "Serge Orlov" <[EMAIL PROTECTED]> wrote: > > > This thread is about problems only with LANG=C or LANG=POSIX, it's not > > about other locales. Other locales are working as expected. > > You are not right. I have LANG=de_DE.UTF-8, and the Python test_re.py > doesn't pass. I meant "only with C or POSIX locales" when I wrote "only with LANG=C or LANG=POSIX". My bad. > $LANG doesn't matter if I don't call setlocale. Sure. > Fortunately setting any non-C locale solves the problem for all (I > believe) unicode character: > > >>> re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430') > [u'\xb5\xba\xe4\u0430'] I can't find the strict definition of isalpha, but I believe average C program shouldn't care about the current locale alphabet, so isalpha is a union of all supported characters in all alphabets Serge. -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
On 10 Feb 2005 11:49:33 -0800 "Serge Orlov" <[EMAIL PROTECTED]> wrote: > This thread is about problems only with LANG=C or LANG=POSIX, it's not > about other locales. Other locales are working as expected. You are not right. I have LANG=de_DE.UTF-8, and the Python test_re.py doesn't pass. $LANG doesn't matter if I don't call setlocale. Fortunately setting any non-C locale solves the problem for all (I believe) unicode character: >>> re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430') [u'\xb5\xba\xe4\u0430'] -- Denis S. Otkidach http://www.python.ru/ [ru] -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
Peter Maas wrote: > Serge Orlov schrieb: > > Denis S. Otkidach wrote: > > To summarize the discussion: either it's a bug in glibc or there is an > > option to specify modern POSIX locale. POSIX locale consist of > > characters from the portable character set, unicode is certainly > > portable. > > What about the environment variable LANG? I have SuSE 9.1 and > LANG = de_DE.UTF-8. Your example is running well on my computer. This thread is about problems only with LANG=C or LANG=POSIX, it's not about other locales. Other locales are working as expected. Serge. -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
Peter Maas wrote: >> To summarize the discussion: either it's a bug in glibc or there is an >> option to specify modern POSIX locale. POSIX locale consist of >> characters from the portable character set, unicode is certainly >> portable. > > What about the environment variable LANG? I have SuSE 9.1 and > LANG = de_DE.UTF-8. Your example is running well on my computer. Python's Unicode subsystem shouldn't depend on the system's LANG setting. -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
Serge Orlov schrieb: Denis S. Otkidach wrote: To summarize the discussion: either it's a bug in glibc or there is an option to specify modern POSIX locale. POSIX locale consist of characters from the portable character set, unicode is certainly portable. What about the environment variable LANG? I have SuSE 9.1 and LANG = de_DE.UTF-8. Your example is running well on my computer. -- --- Peter Maas, M+R Infosysteme, D-52070 Aachen, Tel +49-241-93878-0 E-mail 'cGV0ZXIubWFhc0BtcGx1c3IuZGU=\n'.decode('base64') --- -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
Denis S. Otkidach wrote: > On all platfroms \w matches all unicode letters when used with flag > re.UNICODE, but this doesn't work on SuSE 9.2: > > Python 2.3.4 (#1, Dec 17 2004, 19:56:48) > [GCC 3.3.4 (pre 3.3.5 20040809)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import re > >>> re.compile(ur'\w+', re.U).match(u'\xe4') > >>> > > BTW, is correctly recognize this character as lowercase letter: > >>> import unicodedata > >>> unicodedata.category(u'\xe4') > 'Ll' > > I've looked through all SuSE patches applied, but found nothing > related. What is the reason for broken behavior? Incorrect > configure options? To summarize the discussion: either it's a bug in glibc or there is an option to specify modern POSIX locale. POSIX locale consist of characters from the portable character set, unicode is certainly portable. Serge. -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
On Thu, 10 Feb 2005 17:46:06 +0100 "Fredrik Lundh" <[EMAIL PROTECTED]> wrote: > > Can --with-wctype-functions configure option be the > > source of problem? > > yes. > > that option disables Python's own Unicode database, and relies on the C > library's > wctype.h (iswalpha, etc) to behave properly for Unicode characters. this > isn't true > for all environments. > > is this an official SuSE release? do they often release stuff that hasn't > been tested > at all? Yes, it's official release: # rpm -qi python Name: python Relocations: (not relocatable) Version : 2.3.4 Vendor: SUSE LINUX AG, Nuernberg, Germany Release : 3 Build Date: Tue Oct 5 02:28:25 2004 Install date: Fri Jan 28 13:53:49 2005 Build Host: gambey.suse.de Group : Development/Languages/Python Source RPM: python-2.3.4-3.src.rpm Size: 15108594 License: Artistic License, Other License(s), see package Signature : DSA/SHA1, Tue Oct 5 02:42:38 2004, Key ID a84edae89c800aca Packager: http://www.suse.de/feedback URL : http://www.python.org/ Summary : Python Interpreter BTW, where have they found something with Artistic License in Python? -- Denis S. Otkidach http://www.python.ru/ [ru] -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
Denis S. Otkidach wrote: > On 10 Feb 2005 03:59:51 -0800 > "Serge Orlov" <[EMAIL PROTECTED]> wrote: > > > > On all platfroms \w matches all unicode letters when used with flag > > > re.UNICODE, but this doesn't work on SuSE 9.2: > [...] > > I can get the same results on RedHat's python 2.2.3 if I pass re.L > > option, it looks like this option is implicitly set in Suse. > > Looks like you are right: > > >>> import re > >>> re.compile(ur'\w+', re.U).match(u'\xe4') > >>> from locale import * > >>> setlocale(LC_ALL, 'de_DE') > 'de_DE' > >>> re.compile(ur'\w+', re.U).match(u'\xe4') > <_sre.SRE_Match object at 0x40375560> > > But I see nothing related to implicit re.L option in their patches > and the sources themselves are the same as on other platforms. I'd > prefer to find the source of problem. I found that print u'\xc4'.isalpha() import locale print locale.getlocale() produces different results on Suse (python 2.3.3) False (None, None) and RedHat (python 2.2.3) 1 (None, None) Serge. -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
Denis S. Otkidach wrote: >> > On all platfroms \w matches all unicode letters when used with flag >> > re.UNICODE, but this doesn't work on SuSE 9.2: >> >> I think Python on SuSE 9.2 uses UCS4 for unicode strings (as does >> RedHat), check sys.maxunicode. >> >> This is not an explanation, but perhaps a hint where to look. > > Yes, it uses UCS4. But debian build with UCS4 works fine, so this is > not a problem. Can --with-wctype-functions configure option be the > source of problem? yes. that option disables Python's own Unicode database, and relies on the C library's wctype.h (iswalpha, etc) to behave properly for Unicode characters. this isn't true for all environments. is this an official SuSE release? do they often release stuff that hasn't been tested at all? -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
On Thu, 10 Feb 2005 16:23:09 +0100 Daniel Dittmar <[EMAIL PROTECTED]> wrote: > Denis S. Otkidach wrote: > > > On all platfroms \w matches all unicode letters when used with flag > > re.UNICODE, but this doesn't work on SuSE 9.2: > > I think Python on SuSE 9.2 uses UCS4 for unicode strings (as does > RedHat), check sys.maxunicode. > > This is not an explanation, but perhaps a hint where to look. Yes, it uses UCS4. But debian build with UCS4 works fine, so this is not a problem. Can --with-wctype-functions configure option be the source of problem? -- Denis S. Otkidach http://www.python.ru/ [ru] -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
Denis S. Otkidach wrote: On all platfroms \w matches all unicode letters when used with flag re.UNICODE, but this doesn't work on SuSE 9.2: I think Python on SuSE 9.2 uses UCS4 for unicode strings (as does RedHat), check sys.maxunicode. This is not an explanation, but perhaps a hint where to look. Daniel -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
On 10 Feb 2005 03:59:51 -0800 "Serge Orlov" <[EMAIL PROTECTED]> wrote: > > On all platfroms \w matches all unicode letters when used with flag > > re.UNICODE, but this doesn't work on SuSE 9.2: [...] > I can get the same results on RedHat's python 2.2.3 if I pass re.L > option, it looks like this option is implicitly set in Suse. Looks like you are right: >>> import re >>> re.compile(ur'\w+', re.U).match(u'\xe4') >>> from locale import * >>> setlocale(LC_ALL, 'de_DE') 'de_DE' >>> re.compile(ur'\w+', re.U).match(u'\xe4') <_sre.SRE_Match object at 0x40375560> But I see nothing related to implicit re.L option in their patches and the sources themselves are the same as on other platforms. I'd prefer to find the source of problem. -- Denis S. Otkidach http://www.python.ru/ [ru] -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
On Thu, 10 Feb 2005 13:00:42 +0300 "Denis S. Otkidach" <[EMAIL PROTECTED]> wrote: > On all platfroms \w matches all unicode letters when used with flag > re.UNICODE, but this doesn't work on SuSE 9.2: > > Python 2.3.4 (#1, Dec 17 2004, 19:56:48) > [GCC 3.3.4 (pre 3.3.5 20040809)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import re > >>> re.compile(ur'\w+', re.U).match(u'\xe4') > >>> > > BTW, is correctly recognize this character as lowercase letter: > >>> import unicodedata > >>> unicodedata.category(u'\xe4') > 'Ll' > > I've looked through all SuSE patches applied, but found nothing > related. What is the reason for broken behavior? Incorrect configure > options? Just a bit more information. test_re.py fails in SuSE 9.2 with the following errors: Running re_tests test suite === Failed incorrectly ('(?u)\\b.\\b', u'\xc4', 0, 'found', u'\xc4') === Failed incorrectly ('(?u)\\w', u'\xc4', 0, 'found', u'\xc4') -- Denis S. Otkidach http://www.python.ru/ [ru] -- http://mail.python.org/mailman/listinfo/python-list
Re: sre is broken in SuSE 9.2
Denis S. Otkidach wrote: > On all platfroms \w matches all unicode letters when used with flag > re.UNICODE, but this doesn't work on SuSE 9.2: > > Python 2.3.4 (#1, Dec 17 2004, 19:56:48) > [GCC 3.3.4 (pre 3.3.5 20040809)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import re > >>> re.compile(ur'\w+', re.U).match(u'\xe4') > >>> > > BTW, is correctly recognize this character as lowercase letter: > >>> import unicodedata > >>> unicodedata.category(u'\xe4') > 'Ll' > > I've looked through all SuSE patches applied, but found nothing related. > What is the reason for broken behavior? Incorrect configure options? I can get the same results on RedHat's python 2.2.3 if I pass re.L option, it looks like this option is implicitly set in Suse. Serge -- http://mail.python.org/mailman/listinfo/python-list
sre is broken in SuSE 9.2
On all platfroms \w matches all unicode letters when used with flag re.UNICODE, but this doesn't work on SuSE 9.2: Python 2.3.4 (#1, Dec 17 2004, 19:56:48) [GCC 3.3.4 (pre 3.3.5 20040809)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import re >>> re.compile(ur'\w+', re.U).match(u'\xe4') >>> BTW, is correctly recognize this character as lowercase letter: >>> import unicodedata >>> unicodedata.category(u'\xe4') 'Ll' I've looked through all SuSE patches applied, but found nothing related. What is the reason for broken behavior? Incorrect configure options? -- Denis S. Otkidach http://www.python.ru/ [ru] -- http://mail.python.org/mailman/listinfo/python-list