[issue20049] string.lowercase and string.uppercase can contain garbage

2013-12-22 Thread Antoine Pitrou

Antoine Pitrou added the comment:

 I've discussed this once more. 
 
 From islower man page:
 
 RETURN VALUES
  If the argument to any of the character handling  macros  is
  not  in the domain of the function, the result is undefined.

This is not the wording of the POSIX spec:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/islower.html

The c argument is an int, the value of which the application shall
ensure is a character representable as an unsigned char or equal to the
value of the macro EOF.

This means that any value between 0 and 255 (representable as an
unsigned char) is a valid input for islower().

This would mean IllumOS deviates from the POSIX spec here. I would
suggest either fixing your libc's ctype.h implementation, and/or
patching your version of Python to workaround this issue.

Note the ISO C99 standard has the same wording as POSIX:

The header ctype.h declares several functions useful for
classifying and mapping characters. In all cases the argument is an int,
the value of which shall be representable as an unsigned char or shall
equal the value of the macro EOF.

(Note also that under Linux and most likely other Unices,
string.lowercase and string.uppercase work fine under a UTF-8 locale)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20049
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20049] string.lowercase and string.uppercase can contain garbage

2013-12-22 Thread Antoine Pitrou

Antoine Pitrou added the comment:

To elaborate yet a bit, I agree with the following statement in the 
aforementioned [illumos-devel] discussion thread:

In further explanation, the isalpha() and friends *should* probably return 
false for the value 196, or any other byte with high order bit set, in UTF-8 
locales.
http://thread.gmane.org/gmane.os.illumos.devel/14193/focus=14206

I'll also point out that the code examples in the POSIX spec use islower() 
exactly like Python does (on arbitrary integers) between 0 and 255:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/islower.html

c = (unsigned char) (rand() % 256);
...
if (islower(c))
keystr[len++] = c;
}
...

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20049
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20049] string.lowercase and string.uppercase can contain garbage

2013-12-22 Thread Antoine Pitrou

Antoine Pitrou added the comment:

As to whether we will add a workaround for this in Python:

- Python follows POSIX correctly here, and no issue was reported in mainstream 
OSes such as Linux, OS X or the *BSDs

- this only exists in 2.7, which is in extended maintenance mode (it's the last 
of the 2.x series, and will probably stopped being maintained in a few years); 
Python 3.x doesn't have this issue

- IllumOS is a rather niche OS that none of us is using, so adding a 
system-specific workaround doesn't sound very compelling

Thanks for reporting, though. It's good to be reminded that locales and ctype.h 
are a rather lousy design :-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20049
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20049] string.lowercase and string.uppercase can contain garbage

2013-12-22 Thread Stefan Krah

Stefan Krah added the comment:

Alexander, the domain fo the function probably refers to
the range [-1, 256].

C99:


The header ctype.h declares several functions useful for classifying and 
mapping
characters.166) In all cases the argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value of the macro EOF. If 
the
argument has any other value, the behavior is undefined.
2 The behavior of these functions is affected by the current locale. Those 
functions that
 have locale-specific aspects only when not in the C locale are noted below.
3 The term printing character refers to a member of a locale-specific set of 
characters, each
 of which occupies one printing position on a display device; the term control 
character
refers to a member of a locale-specific set of characters that are not printing
characters.167) All letters and digits are printing characters.
Forward references: EOF (7.19.1), localization (7.11).
7.4.1 Character classification functions
1
The functions in this subclause return nonzero (true) if and only if the value 
of the
argument c conforms to that in the description of the function.


I think this agrees with what Antoine has said.

--
nosy: +skrah

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20049
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20049] string.lowercase and string.uppercase can contain garbage

2013-12-22 Thread Stefan Krah

Stefan Krah added the comment:

IOW, I also support closing this issue. :)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20049
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20049] string.lowercase and string.uppercase can contain garbage

2013-12-22 Thread R. David Murray

R. David Murray added the comment:

Yes, I definitely think this falls into the category of platform bugs, and we 
only maintain workarounds for those for mainstream OSes.  Others need to 
maintain their own local patches, just as for any other changes that are 
required to get Python working on those platforms.  (A platform's status can 
change over time, of course, but this is the category illumos currently falls 
into.)

--
resolution:  - rejected
stage:  - committed/rejected
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20049
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20049] string.lowercase and string.uppercase can contain garbage

2013-12-21 Thread Alexander Pyhalov

New submission from Alexander Pyhalov:

When Python 2.6 (or 2.7) compiled with _XOPEN_SOURCE=600 on illumos  
string.lowercase and string.uppercase contain garbage when UTF-8 locale is 
used. 
(OpenIndiana bug report - https://www.illumos.org/issues/4411 ).
The reason is that with UTF-8 locale islower()/isupper() and similar functions 
are not expected to work with non-ascii symbols. 
So, code like 

n = 0;
for (c = 0; c  256; c++) {
if (islower(c))
buf[n++] = c;
}

is expected to fail, because it calls islower on illegal UTF-8 symbols (with 
codes 128-255). It should be converted to something like

n = 0;
for (c = 0; c  256; c++) {
if (isascii(c)  islower(c))
buf[n++] = c;
}

or to 

n = 0;
for (c = 0; c  128; c++) {
if (islower(c))
buf[n++] = c;
}

Before doing this you should check if locale is UTF-8. However, almost all 
non-C locales on illumos are UTF-8. 


Example of incorrect behavior: 

Python 2.6.9 (unknown, Nov 12 2013, 13:54:48) 
[GCC 4.7.3] on sunos5
Type help, copyright, credits or license for more information.
 import string
 string.lowercase
'abcdefghijklmnopqrstuvwxyz\\xaa\\xb5\\xba\\xdf\\xe0\\xe1\\xe2\\xe3\\xe4\\xe5\\xe6\\xe7\\xe8\\xe9\\xea\\xeb\\xec\\xed\\xee\\xef\\xf0\\xf1\\xf2\\xf3\\xf4\\xf5\\xf6\\xf8\\xf9\\xfa\\xfb\\xfc\\xfd\\xfe\\xff'
 string.uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ\\xc0\\xc1\\xc2\\xc3\\xc4\\xc5\\xc6\\xc7\\xc8\\xc9\\xca\\xcb\\xcc\\xcd\\xce\\xcf\\xd0\\xd1\\xd2\\xd3\\xd4\\xd5\\xd6\\xd8\\xd9\\xda\\xdb\\xdc\\xdd\\xde'


--
components: Unicode
messages: 206786
nosy: Alexander.Pyhalov, ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: string.lowercase and string.uppercase can contain garbage
type: behavior
versions: Python 2.7

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20049
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20049] string.lowercase and string.uppercase can contain garbage

2013-12-21 Thread R. David Murray

R. David Murray added the comment:

In python2, string.lowercase and string.uppercase are locale dependent.  This 
isn't really all that useful in practice, which is why it was dropped in 
Python3.  The proposed fix might be correct, *if* utf-8 is checked for (see, 
eg, Issue 6525), but...do you have any idea why this is a problem on illumos 
with _XOPEN_SOURCE=600 but not on any other platform (as far as we know)?  It 
seems like it would be a bug in the platform's islower and isupper functions, 
which are supposed to operate on integers that fit in an unsigned char, and be 
locale aware, according to the standards.

--
nosy: +r.david.murray

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20049
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20049] string.lowercase and string.uppercase can contain garbage

2013-12-21 Thread Antoine Pitrou

Antoine Pitrou added the comment:

 The reason is that with UTF-8 locale islower()/isupper() and similar
 functions are not expected to work with non-ascii symbols. 

Can you explain why?

--
nosy: +pitrou

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20049
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20049] string.lowercase and string.uppercase can contain garbage

2013-12-21 Thread Alexander Pyhalov

Alexander Pyhalov added the comment:

Honestly, I don't understand locale-related things good enough. But I
received this explanation when discussed similar issue in illumos
developers mailing list.
http://comments.gmane.org/gmane.os.illumos.devel/14193

2013/12/22 Antoine Pitrou rep...@bugs.python.org


 Antoine Pitrou added the comment:

  The reason is that with UTF-8 locale islower()/isupper() and similar
  functions are not expected to work with non-ascii symbols.

 Can you explain why?

 --
 nosy: +pitrou

 ___
 Python tracker rep...@bugs.python.org
 http://bugs.python.org/issue20049
 ___


--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20049
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20049] string.lowercase and string.uppercase can contain garbage

2013-12-21 Thread Alexander Pyhalov

Alexander Pyhalov added the comment:

I've discussed this once more. 

From islower man page:

RETURN VALUES
 If the argument to any of the character handling  macros  is
 not  in the domain of the function, the result is undefined.

And (char)128-255 are not legal UTF-8 (at least what I see from wikipedia: 
http://en.wikipedia.org/wiki/UTF-8 ).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20049
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com