[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-30 Thread Georg Brandl

Georg Brandl added the comment:

I think you will, Matthew being MRAB on the mailing lists :)

--
nosy: +georg.brandl

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-29 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
assignee:  - serhiy.storchaka

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-29 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 44a4f9289faa by Serhiy Storchaka in branch '3.3':
Issue #16688: Fix backreferences did make case-insensitive regex fail on 
non-ASCII strings.
http://hg.python.org/cpython/rev/44a4f9289faa

New changeset c59ee1ff6f27 by Serhiy Storchaka in branch 'default':
Issue #16688: Fix backreferences did make case-insensitive regex fail on 
non-ASCII strings.
http://hg.python.org/cpython/rev/c59ee1ff6f27

--
nosy: +python-dev

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-29 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Fixed. Thank you for a patch, Matthew. I hope to see more your patches.

--
resolution:  - fixed
stage: commit review - committed/rejected
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-16 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

The patches LGTM. How about adding a test?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-16 Thread Matthew Barnett

Matthew Barnett added the comment:

Here are some tests for the issue.

--
Added file: http://bugs.python.org/file28330/issue16688#3.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-16 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

The second test pass on unpatched Python.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-16 Thread Matthew Barnett

Matthew Barnett added the comment:

Oops! :-( Now corrected.

--
Added file: http://bugs.python.org/file28332/issue16688#3.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-16 Thread Matthew Barnett

Changes by Matthew Barnett pyt...@mrabarnett.plus.com:


Removed file: http://bugs.python.org/file28330/issue16688#3.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-16 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

LGTM.

Matthew, can you please submit a contributor form?

http://python.org/psf/contrib/contrib-form/
http://python.org/psf/contrib/

--
stage: patch review - commit review

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-15 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Good analysis, Matthew. Are you want to submit a patch?

--
keywords: +easy
stage:  - needs patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-15 Thread Matthew Barnett

Matthew Barnett added the comment:

OK, here's a patch.

--
keywords: +patch
Added file: http://bugs.python.org/file28321/issue16688.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-15 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
stage: needs patch - patch review

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-15 Thread STINNER Victor

STINNER Victor added the comment:

Can someone check if there is no other similar regression (introduced
by the PEP 393)?

2012/12/15 Serhiy Storchaka rep...@bugs.python.org:

 Changes by Serhiy Storchaka storch...@gmail.com:


 --
 stage: needs patch - patch review

 ___
 Python tracker rep...@bugs.python.org
 http://bugs.python.org/issue16688
 ___

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-15 Thread Matthew Barnett

Matthew Barnett added the comment:

I found another bug while looking through the source.

On line 495 in function SRE_COUNT:

if (maxcount  end - ptr  maxcount != 65535)
end = ptr + maxcount*state-charsize;

where 'end' and 'ptr' are of type 'char*'. That means that 'end - ptr' is the 
length in _bytes_, not characters.

If the byte after the end of the string is 0 then you get this:

 # Good:
 re.search(r\x00{1,3}, a\x00\x00).span()
(1, 3)
 # Bad:
 re.search(r\x00{1,3}, \u0100\x00\x00).span()
(1, 4)

I'll keep looking before submitting a patch.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-15 Thread Matthew Barnett

Matthew Barnett added the comment:

I found another bug while looking through the source.

On line 495 in function SRE_COUNT:

if (maxcount  end - ptr  maxcount != 65535)
end = ptr + maxcount*state-charsize;

where 'end' and 'ptr' are of type 'char*'. That means that 'end - ptr' is the 
length in _bytes_, not characters.

If the byte after the end of the string is 0 then you get this:

 # Good:
 re.search(r\x00{1,3}, a\x00\x00).span()
(1, 3)
 # Bad:
 re.search(r\x00{1,3}, \u0100\x00\x00).span()
(1, 4)

I'll keep looking before submitting a patch.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-15 Thread Matthew Barnett

Matthew Barnett added the comment:

I haven't found any other issues, so here's the second patch.

--
Added file: http://bugs.python.org/file28325/issue16688#2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-14 Thread pyos

New submission from pyos:

The title says it all: if a regular expression that makes use of backreferences 
is compiled with `re.I` flag, it will always fail when matched against a string 
that contains characters outside of U+-U+00FF range. I've been unable to 
further narrow the bug down.

A simple example:

 import re
 r = re.compile(r'(a)\1', re.I)  # should match aa, aA, Aa, or AA
 r.findall('aa')  # works as expected
['a']
 r.findall('aa bcd')  # still works
['a']
 r.findall('aa Ā')  # ord('Ā') == 0x0100
[]

The same code works as expected in Python 3.2:

 r.findall('aa Ā')
['a']

--
components: Regular Expressions
messages: 177518
nosy: ezio.melotti, mrabarnett, pitrou, pyos
priority: normal
severity: normal
status: open
title: Backreferences make case-insensitive regex fail on non-ASCII strings.
type: behavior
versions: Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-14 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@gmail.com:


--
nosy: +haypo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-14 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@gmail.com:


--
nosy: +serhiy.storchaka

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-14 Thread Ezio Melotti

Ezio Melotti added the comment:

It works on 2.7 too, and fails on 3.3/3.x.
Maybe it's related to PEP 393?

--
versions: +Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-14 Thread Arfrever Frehtes Taifersar Arahesis

Changes by Arfrever Frehtes Taifersar Arahesis arfrever@gmail.com:


--
nosy: +Arfrever

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.

2012-12-14 Thread Matthew Barnett

Matthew Barnett added the comment:

In function SRE_MATCH, the code for SRE_OP_GROUPREF (line 1290) contains this:

while (p  e) {
if (ctx-ptr = end ||
SRE_CHARGET(state, ctx-ptr, 0) != SRE_CHARGET(state, p, 0))
RETURN_FAILURE;
p += state-charsize;
ctx-ptr += state-charsize;
}

However, the code for SRE_OP_GROUPREF_IGNORE (line 1316) contains this:

while (p  e) {
if (ctx-ptr = end ||
state-lower(SRE_CHARGET(state, ctx-ptr, 0)) != state-lower(*p))
RETURN_FAILURE;
p++;
ctx-ptr += state-charsize;
}

(In both cases 'p' is of type 'char*'.)

The problem appears to be that the latter is still using '*p' and 'p++' and is 
thus always working with chars (it gets and advances 1 byte at a time instead 
of 1, 2 or 4 bytes for Unicode).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16688
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com