[issue10567] Unicode space character \u200b unrecognised a space

2010-11-28 Thread pbnan

New submission from pbnan pi...@banaszkiewicz.org:

Python:
Python 2.7 (r27:82500, Oct 20 2010, 03:21:03) 
[GCC 4.5.1] on linux2

Code:
 c = u'\u200b'
 c.isspace()
False

In both 2.6, 3.1 it works.

http://www.cs.tut.fi/~jkorpela/chars/spaces.html

--
components: Unicode
messages: 122690
nosy: pbnan
priority: normal
severity: normal
status: open
title: Unicode space character \u200b unrecognised a space
type: behavior
versions: Python 2.7

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10567
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10567] Unicode space character \u200b unrecognised a space

2010-11-28 Thread SilentGhost

SilentGhost michael.mischurow+...@gmail.com added the comment:

It returns False on the latest py3k checkout as well.

--
nosy: +SilentGhost

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10567
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10567] Unicode space character \u200b unrecognised a space

2010-11-28 Thread SilentGhost

Changes by SilentGhost michael.mischurow+...@gmail.com:


--
versions: +Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10567
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10567] Unicode space character \u200b unrecognised a space

2010-11-28 Thread Alexander Belopolsky

Changes by Alexander Belopolsky belopol...@users.sourceforge.net:


--
nosy: +belopolsky

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10567
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10567] Unicode space character \u200b unrecognised a space

2010-11-28 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

The category of U-200B was changed in Unicode 4.0.1:


The main new features in Unicode 4.0.1 are the following:
 ...
 * Changed: general category of U+200B ZERO WIDTH SPACE
 http://unicode.org/versions/Unicode4.0.1/

--
resolution:  - invalid
status: open - pending

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10567
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10567] Unicode space character \u200b unrecognised a space

2010-11-28 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

In 2.6, there was a manually maintained list, probably dating back to before 
Unicode 4.0. Python uses the following criterion for determining white space 
characters:

/* Returns 1 for Unicode characters having the bidirectional type
   'WS', 'B' or 'S' or the category 'Zs', 0 otherwise. */

Since r75272, this is generated from the current Unicode database, and should 
thus be always correct.

Unless you can somehow prove that the criterion should be changed, or that 
Python computes it incorrectly, I'm closing this report as invalid.

--
nosy: +loewis
status: pending - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10567
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10567] Unicode space character \u200b unrecognised a space

2010-11-28 Thread SilentGhost

SilentGhost michael.mischurow+...@gmail.com added the comment:

It's not just this character. isspace() is also False for \u200c and \u200d 
(from the same category). and \u2060, \u2800 and \ufeff

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10567
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10567] Unicode space character \u200b unrecognised a space

2010-11-28 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 It's not just this character. isspace() is also False for \u200c and \u200d 
 (from the same category). and \u2060, \u2800 and \ufeff

What reason do you have to believe that they should be classified as
whitespace, other than the web page you are quoting (which is apparently
out of date, and the fact that Python 2.6 was classifying
them this way, which are also out of date for the very same reason)?

--
title: Some unicode space characters are not recognized as a space - Unicode 
space character \u200b unrecognised a space

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10567
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10567] Unicode space character \u200b unrecognised a space

2010-11-28 Thread SilentGhost

SilentGhost michael.mischurow+...@gmail.com added the comment:

I'm not quoting anything. Thank you very much.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10567
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10567] Unicode space character \u200b unrecognised a space

2010-11-28 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Martin v. Löwis wrote:
 
 Martin v. Löwis mar...@v.loewis.de added the comment:
 
 In 2.6, there was a manually maintained list, probably dating back to before 
 Unicode 4.0. 

That's not quite correct: Python 1.6.x - 2.5.x used tables for the
PyUnicode_ISSPACE() function that were created from the Unicode database.
Python 2.6.x introduced a short-cut table for ASCII whitespace, but still
reverted back to the generated tables for non-ASCII code points.

The tables were never manually maintained, but we also did not update
Python for each new Unicode version:

Python 1.6: Unicode 3.0
Python 2.0: Unicode 3.0
Python 2.1: Unicode 3.0
Python 2.2: Unicode 3.0
Python 2.3: Unicode 3.2
Python 2.4: Unicode 3.2
Python 2.5: Unicode 4.1
Python 2.6: Unicode 5.1
Python 2.7: Unicode 5.2

 Python uses the following criterion for determining white space characters:

 /* Returns 1 for Unicode characters having the bidirectional type
'WS', 'B' or 'S' or the category 'Zs', 0 otherwise. */

This definition has been used since Python 1.6.x.

--
nosy: +lemburg

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10567
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10567] Unicode space character \u200b unrecognised a space

2010-11-28 Thread SilentGhost

Changes by SilentGhost michael.mischurow+...@gmail.com:


--
nosy:  -SilentGhost

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10567
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10567] Unicode space character \u200b unrecognised a space

2010-11-28 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 I'm not quoting anything. Thank you very much.

Oops, sorry - I confused you with the OP.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10567
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10567] Unicode space character \u200b unrecognised a space

2010-11-28 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Sun, Nov 28, 2010 at 2:07 PM, Marc-Andre Lemburg
rep...@bugs.python.org wrote:
..
 The tables were never manually maintained, but we also did not update
 Python for each new Unicode version:

 Python 1.6: Unicode 3.0
 Python 2.0: Unicode 3.0
 Python 2.1: Unicode 3.0
 Python 2.2: Unicode 3.0
 Python 2.3: Unicode 3.2
 Python 2.4: Unicode 3.2
 Python 2.5: Unicode 4.1
 Python 2.6: Unicode 5.1
 Python 2.7: Unicode 5.2


Thank you for the summary.  Note that Python reference pages have been
updated even less frequently. [1]  Since Python language and standard
library definitions are now (in 3.x) closely tied to the Unicode
definition, I wonder whether unicodedata.unidata_version should be
more prominently featured in the docs.  (Possibly even included in the
Python CLI banner, but that is probably an overkill.)

[1] http://mail.python.org/pipermail/docs/2010-November/002074.html

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10567
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10567] Unicode space character \u200b unrecognised a space

2010-11-28 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 In 2.6, there was a manually maintained list, probably dating back to before 
 Unicode 4.0. 
 
 That's not quite correct: Python 1.6.x - 2.5.x used tables for the
 PyUnicode_ISSPACE() function that were created from the Unicode database.

That used to be the case until r39757, when you made this change:


r39757 | lemburg | 2005-10-20 21:06:35 +0200 (Do, 20. Okt 2005) | 7 Zeilen
Geänderte Pfade:
   M /python/trunk/Objects/unicodectype.c

Enhance the performance of two important Unicode character
type lookups: whitespace and linebreak.

These lookup tables are from the Python 1.6 version with the addition
of the 205F code point which was added as whitespace code point to
Unicode since then.



In 2.5 and 2.6, there was no table lookup anymore, but a switch
statement. Not sure how you arrived at the code; the commit message
doesn't say (but the wording suggests it was manually computed).
It was not updated in 2.6.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10567
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10567] Unicode space character \u200b unrecognised a space

2010-11-28 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

It is still strange that the .isspace() property value changed,
since the code point has not changed in the recent Unicode versions:

4.1.0: 200B;ZERO WIDTH SPACE;Cf;0;BN;N;
5.1.0: 200B;ZERO WIDTH SPACE;Cf;0;BN;N;
5.2.0: 200B;ZERO WIDTH SPACE;Cf;0;BN;N;
6.0.0: 200B;ZERO WIDTH SPACE;Cf;0;BN;N;

based on http://www.unicode.org/Public/version/ucd/UnicodeData.txt

True
 python2.5 -c 'print u\u200b.isspace()'
True
 python2.6 -c 'print u\u200b.isspace()'
True
 python2.7 -c 'print u\u200b.isspace()'
False

Looking at the code again: Now I know why...

The tables in unicodectype.c were generated from the Unicode database,
but not by the makeunicodedata.py script. I used a script to generate
those tables for Python 1.6.0 and it seems that they were never updated
since then. Python 2.7 then replaced them with the data from the
makeunicodedata.py script.

That's probably why Martin thought they were manually maintained.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10567
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10567] Unicode space character \u200b unrecognised a space

2010-11-28 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Going back further shows the change:

3.0.1: 200B;ZERO WIDTH SPACE;Zs;0;BN;N;
3.2.0: 200B;ZERO WIDTH SPACE;Zs;0;BN;N;
4.0.1: 200B;ZERO WIDTH SPACE;Cf;0;BN;N;
4.1.0: 200B;ZERO WIDTH SPACE;Cf;0;BN;N;
5.1.0: 200B;ZERO WIDTH SPACE;Cf;0;BN;N;
5.2.0: 200B;ZERO WIDTH SPACE;Cf;0;BN;N;
6.0.0: 200B;ZERO WIDTH SPACE;Cf;0;BN;N;

Interesting that no one noticed in all these years.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10567
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10567] Unicode space character \u200b unrecognised a space

2010-11-28 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Sun, Nov 28, 2010 at 2:40 PM, Marc-Andre Lemburg
rep...@bugs.python.org wrote:
..
 Going back further shows the change:

 3.0.1: 200B;ZERO WIDTH SPACE;Zs;0;BN;N;
 3.2.0: 200B;ZERO WIDTH SPACE;Zs;0;BN;N;
 4.0.1: 200B;ZERO WIDTH SPACE;Cf;0;BN;N;

Yes, see msg122694 above.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10567
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com