[issue1609] test_re.py fails

2007-12-20 Thread Ismail Donmez

Ismail Donmez added the comment:

Hi Martin,

Actually the only problem is how can I get wctype functionality with
8-bit strings, any example is appreciated.

This bug itself is invalid because --with-wctype-functions is
deprecated. But as I said I just hope removing that doesn't regress
Turkish functionality.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-20 Thread Ismail Donmez

Ismail Donmez added the comment:

Funnily,

print .encode(iso-8859-9).decode(iso-8859-9).upper()

works, but

print .encode(iso-8859-9).upper().decode(iso-8859-9)

not.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-20 Thread Guido van Rossum

Guido van Rossum added the comment:

 Funnily,

 print .encode(iso-8859-9).decode(iso-8859-9).upper()

 works, but

 print .encode(iso-8859-9).upper().decode(iso-8859-9)

 not.

You'll have to debug this yourself.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-20 Thread Ismail Donmez

Ismail Donmez added the comment:

I guess so, I will no longer spam this bug. Thanks for the suggestions.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-20 Thread Guido van Rossum

Guido van Rossum added the comment:

Two easy ways to get the functionality using 8-bit strings, assuming
you've already set your locale properly:

(1) If your data is already an 8-bit string (i.e. isinstance(data,
str)), simply use data.upper() or data.lower()

(2) If your data is Unicode (i.e. isinstance(data, unicode)), convert to
8-bit using encode, apply upper/lower, and convert back to unicode. 
E.g. data.encode(Latin-1).upper().decode(Latin-1).  (I don't know
which encoding to use though -- So substitute whatever you have for
Latin-1, but don't use UTF-8.)

PS Martin: the 2.4/2.5 differences were caused by Cartman having hacked
his 2.4 installation to change the default encoding.

--
resolution:  - invalid
status: open - closed

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-20 Thread Martin v. Löwis

Martin v. Löwis added the comment:

 print .encode(iso-8859-9).upper().decode(iso-8859-9)
 does not

Please get your types right.  is a byte string (in Python 2.x).
encode: unicode - string
decode: string - unicode

That you still can apply .encode to the byte string is a bug/pit fall in
Python 2.x, which gets fixed in 3.x (by only supporting .encode on the
unicode type).

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-20 Thread Ismail Donmez

Ismail Donmez added the comment:

Tried like ,

unicode(iii).encode(iso-8859-9).upper()

doesn't work, I'll ask on python users list. Thanks.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-19 Thread Ismail Donmez

Ismail Donmez added the comment:

Python README says --with-wctype-functions is deprecated and will be
removed in Python 2.6 , I don't think its worth to fix it now. Also test
failures with --with-wctype-functions is seems to be known according to
Google.

What I wonder if removing --with-wctype-functions causes any regressions
under Turkish locale. I will do some research on that.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-19 Thread Ismail Donmez

Ismail Donmez added the comment:

Indeed there seems to be regressions:

Python 2.4 :

[~] python
Python 2.4.4 (#1, Oct 23 2007, 11:25:50)
[GCC 3.4.6] on linux2
Type help, copyright, credits or license for more information.
 import locale
 locale.setlocale(locale.LC_ALL,)
'tr_TR.UTF-8'
 print unicode()

 print unicode().upper()

 print unicode(i).upper()
İ
 print unicode(İ).lower()
i
 print unicode(III).lower()
ııı


Python 2.5 (incorrect) :

 import locale
 locale.setlocale(locale.LC_ALL,)
'tr_TR.UTF-8'
 print unicode(i).upper()
I
 print unicode().upper()
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
ordinal not in range(128)
 print unicode().upper()



Looks like wctypes should not be dropped.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-19 Thread Ismail Donmez

Ismail Donmez added the comment:

Situation is even more complicated, following functions behave
_correctly_ when wctypes is enabled :

 print unicode(i).upper()
İ
 print unicode().lower()


Following doesn't work even if wctypes is enabled :

 print unicode().upper()
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
ordinal not in range(128)
 print unicode(İ).lower()
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
ordinal not in range(128)

All of these four calls works fine in python 2.4 when wctypes is enabled.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-19 Thread Guido van Rossum

Guido van Rossum added the comment:

Martin, can you have a look at this?

Cartman, can you produce a unittest for the correct behavior that only
uses ASCII input (using \u instead of just typing Turkish characters)?

--
assignee:  - loewis
nosy: +loewis

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-19 Thread Ismail Donmez

Ismail Donmez added the comment:

So in conclusion,

- Enabling wctypes makes Turkish support work with \u syntax, breaks
unicode()
- Disabling wctypes breaks Turkish support with \u and/or unicode()

Attached test.py tests Turkish corner cases of lower()/upper() . Correct
output is which python 2.4 gives :

Following should print I
I
Following should print i
i
Following should print İ
İ
Following should print ı
ı

Added file: http://bugs.python.org/file9006/test.py

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__#!/usr/bin/python
# -*- coding: utf-8 -*-

import locale

locale.setlocale(locale.LC_ALL,tr_TR.UTF-8)

print Following should print I
try:
print u\u0131.upper()
except UnicodeDecodeError:
print Got a unicode decode error

print Following should print i
try:
print u\u0130.lower()
except UnicodeDecodeError:
print Got a unicode decode error

print uFollowing should print Ä°
try:
print ui.upper()
except UnicodeDecodeError:
print Got a unicode decode error

print uFollowing should print ı
try:
print uI.lower()
except UnicodeDecodeError:
print Got a unicode decode error
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-19 Thread Guido van Rossum

Guido van Rossum added the comment:

Hm.  The test2.py file, when I download it, contains the two bytes
\xc4\xb1 in the first unicode() call, and \xc4\xb0 in the second
one.  This is *always* supposed to produce a UnicodeDecodeError, since
it would use the default encoding which is ASCII.  So I don't understand
how you get this to pass with 2.4 at all.

When you replace the arguments with these hex escapes, does it still
pass for you?  Or does that break it?

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-19 Thread Ismail Donmez

Ismail Donmez added the comment:

Replacing Turkish characters with hex versions in test2.py still results
in UnicodeDecodeError and works with python 2.4.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-19 Thread Guido van Rossum

Guido van Rossum added the comment:

 Replacing Turkish characters with hex versions in test2.py still results
 in UnicodeDecodeError and works with python 2.4.

I'm hoping Martin can confirm this, but I suspect that this is due to
a tightening of the rules for converting from 8-bit strings to unicode
strings.

What happens if you change to unicode(, utf-8)?

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-19 Thread Ismail Donmez

Ismail Donmez added the comment:

Ok that was because we had modified default encoding in Lib/site.py to
be utf-8. Sorry!

The only problem left is last 2 conversions in test.py gives wrong
results when wctypes is disabled, that is :

print u\u0069.upper()

should give \u0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE)

print u\u0049.lower()

should give \u0131 (LATIN SMALL LETTER DOTLESS I)

These transformations work fine with python2.5 when
--with-wctype-functions is used.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-19 Thread Guido van Rossum

Guido van Rossum added the comment:

 print u\u0069.upper()

 should give \u0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE)

 print u\u0049.lower()

 should give \u0131 (LATIN SMALL LETTER DOTLESS I)

 These transformations work fine with python2.5 when
 --with-wctype-functions is used.

I think that is rather a bug in the wctype functions. Those are ASCII
letters 'i' and 'I' and their upper/lower versions are fixed by the
Unicode standard to be the corresponding ASCII letters ('I' and 'i').
The Unicode case conversions are not affected by locale.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-19 Thread Ismail Donmez

Ismail Donmez added the comment:

But it should be affected by locale, thats the point of locale.setlocale
call. This is how libc's wc functions behave.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-19 Thread Guido van Rossum

Guido van Rossum added the comment:

 But it should be affected by locale, thats the point of locale.setlocale
 call. This is how libc's wc functions behave.

No, the locale should only affect 8-bit string operations, never
unicode operations.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-19 Thread Ismail Donmez

Ismail Donmez added the comment:

Ok then what is the suggested way to get back the Turkish way of doing
upper/lower on  i  I ?

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-19 Thread Guido van Rossum

Guido van Rossum added the comment:

 Ok then what is the suggested way to get back the Turkish way of doing
 upper/lower on  i  I ?

That's a question for Martin von Loewis. I suppose you could use 8-bit
strings exclusively. Or you could use .translate() with a custom dict.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-19 Thread Martin v. Löwis

Martin v. Löwis added the comment:

I think too many issues get mixed in this report. I would like to ignore
all but one issue, but I don't understand what the one issue is that
this report should deal with.

cartman, when you compare Python 2.4 and 2.5, could it be that the 2.4
Python was compiled --with-wctype-functions, and the 2.5 Python
--without-wctype-functions? That would surely explain the difference.

The Unicode lower/upper implementations are, by default, locale-inaware.
That is correct behavior, and by design. If you want locale-dependent
behavior, use 8-bit strings as Guido says.

ISTM that the original report was resolved - the tests don't support
--with-wctype-functions. This is because they assume that they know that 
LATIN CAPITAL LETTER A WITH DIAERESIS is a letter - which may not be the
case if the isletter test is locale-specific. If this is too be fixed,
the proper fix would be to just remove the test, which I advise against
- instead, the best behavior that Python should implement is the current
one, i.e. it is a good thing that the test fails
--with-wctype-functions. Perhaps a comment should be attached explaining
the potential breakage.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-17 Thread Guido van Rossum

Guido van Rossum added the comment:

Focus on how using --with-wctype-functions changes things and how this
could affect the regex implementation. (I wouldn't be surprised if the
other failing tests were to to the regex bugs.)

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-14 Thread Ismail Donmez

Ismail Donmez added the comment:

Any ideas/comments on how to move forward with this?

Thanks,
ismail

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-13 Thread Ismail Donmez

New submission from Ismail Donmez:

Using python 2.5 revision 59479 from release25-maint branch, 

[~/python-2.5] LD_LIBRARY_PATH=/home/cartman/python-2.5: ./python
./Lib/test/test_re.py
test_anyall (__main__.ReTests) ... ok
test_basic_re_sub (__main__.ReTests) ... ok
test_bigcharset (__main__.ReTests) ... ok
test_bug_113254 (__main__.ReTests) ... ok
test_bug_1140 (__main__.ReTests) ... ok
test_bug_114660 (__main__.ReTests) ... ok
test_bug_117612 (__main__.ReTests) ... ok
test_bug_418626 (__main__.ReTests) ... ok
test_bug_448951 (__main__.ReTests) ... ok
test_bug_449000 (__main__.ReTests) ... ok
test_bug_449964 (__main__.ReTests) ... ok
test_bug_462270 (__main__.ReTests) ... ok
test_bug_527371 (__main__.ReTests) ... ok
test_bug_545855 (__main__.ReTests) ... ok
test_bug_581080 (__main__.ReTests) ... ok
test_bug_612074 (__main__.ReTests) ... ok
test_bug_725106 (__main__.ReTests) ... ok
test_bug_725149 (__main__.ReTests) ... ok
test_bug_764548 (__main__.ReTests) ... ok
test_bug_817234 (__main__.ReTests) ... ok
test_bug_926075 (__main__.ReTests) ... ok
test_bug_931848 (__main__.ReTests) ... ok
test_category (__main__.ReTests) ... ok
test_constants (__main__.ReTests) ... ok
test_empty_array (__main__.ReTests) ... ok
test_expand (__main__.ReTests) ... ok
test_finditer (__main__.ReTests) ... ok
test_flags (__main__.ReTests) ... ok
test_getattr (__main__.ReTests) ... ok
test_getlower (__main__.ReTests) ... ok
test_groupdict (__main__.ReTests) ... ok
test_ignore_case (__main__.ReTests) ... ok
test_non_consuming (__main__.ReTests) ... ok
test_not_literal (__main__.ReTests) ... ok
test_pickling (__main__.ReTests) ... ok
test_qualified_re_split (__main__.ReTests) ... ok
test_qualified_re_sub (__main__.ReTests) ... ok
test_re_escape (__main__.ReTests) ... ok
test_re_findall (__main__.ReTests) ... ok
test_re_groupref (__main__.ReTests) ... ok
test_re_groupref_exists (__main__.ReTests) ... ok
test_re_match (__main__.ReTests) ... ok
test_re_split (__main__.ReTests) ... ok
test_re_subn (__main__.ReTests) ... ok
test_repeat_minmax (__main__.ReTests) ... ok
test_scanner (__main__.ReTests) ... ok
test_search_coverage (__main__.ReTests) ... ok
test_search_star_plus (__main__.ReTests) ... ok
test_special_escapes (__main__.ReTests) ... ok
test_sre_character_class_literals (__main__.ReTests) ... ok
test_sre_character_literals (__main__.ReTests) ... ok
test_stack_overflow (__main__.ReTests) ... ok
test_sub_template_numeric_escape (__main__.ReTests) ... ok
test_symbolic_refs (__main__.ReTests) ... ok
test_weakref (__main__.ReTests) ... ok

--
Ran 55 tests in 0.194s

OK
Running re_tests test suite
=== Failed incorrectly ('(?u)\\b.\\b', u'\xc4', 0, 'found', u'\xc4')
=== Failed incorrectly ('(?u)\\w', u'\xc4', 0, 'found', u'\xc4')

--
components: Tests
messages: 58527
nosy: cartman
severity: normal
status: open
title: test_re.py fails
versions: Python 2.5

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-13 Thread Guido van Rossum

Guido van Rossum added the comment:

Can't reproduce.

Like before, what platform, compiler etc.?  Does using ./configure
--with-pydebug make a difference?  What's the LD_LIBRARY_PATH for?

--
nosy: +gvanrossum

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-13 Thread Ismail Donmez

Ismail Donmez added the comment:

gcc 4.3, Linux 2.6.18, 32bit. 

Without  LD_LIBRARY_PATH it would use the system libraries and not the
compiled ones which anyway is not wanted.

Configure line used is (damn I forgot to specify this before, sorry)

--with-fpectl \
--enable-shared \
--enable-ipv6 \
--with-threads \
--enable-unicode=ucs4 \
--with-wctype-functions

--enable-pydebug doesn't help.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-13 Thread Amaury Forgeot d'Arc

Amaury Forgeot d'Arc added the comment:

  Is GCC 4.3 released yet?

 Not yet but soon, its less buggy compared to 4.1 and 4.2 
 at the moment.

Not quite yet, gcc 4.3 had a big inlining bug that was just corrected
two weeks ago:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33434
You may have encountered this bug, or another similar one...

--
nosy: +amaury.forgeotdarc

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-13 Thread Ismail Donmez

Ismail Donmez added the comment:

 Not quite yet, gcc 4.3 had a big inlining bug that was just corrected
 two weeks ago:
 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33434
 You may have encountered this bug, or another similar one...

Two weeks ago is too old for me, I am using SVN snapshot from yesterday :-)

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-13 Thread Ismail Donmez

Ismail Donmez added the comment:

Removing --with-wctype-functions in total fixes following regression tests,

test_codecs 
test_re 
test_ucn 
test_unicodedata

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1609] test_re.py fails

2007-12-13 Thread Ismail Donmez

Ismail Donmez added the comment:

Remove test_ucn from the list, it still fails but its for another bug
report.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1609
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com