[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
Georg Brandl added the comment: I think you will, Matthew being MRAB on the mailing lists :) -- nosy: +georg.brandl ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
Changes by Serhiy Storchaka storch...@gmail.com: -- assignee: - serhiy.storchaka ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
Roundup Robot added the comment: New changeset 44a4f9289faa by Serhiy Storchaka in branch '3.3': Issue #16688: Fix backreferences did make case-insensitive regex fail on non-ASCII strings. http://hg.python.org/cpython/rev/44a4f9289faa New changeset c59ee1ff6f27 by Serhiy Storchaka in branch 'default': Issue #16688: Fix backreferences did make case-insensitive regex fail on non-ASCII strings. http://hg.python.org/cpython/rev/c59ee1ff6f27 -- nosy: +python-dev ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
Serhiy Storchaka added the comment: Fixed. Thank you for a patch, Matthew. I hope to see more your patches. -- resolution: - fixed stage: commit review - committed/rejected status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
Serhiy Storchaka added the comment: The patches LGTM. How about adding a test? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
Matthew Barnett added the comment: Here are some tests for the issue. -- Added file: http://bugs.python.org/file28330/issue16688#3.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
Serhiy Storchaka added the comment: The second test pass on unpatched Python. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
Matthew Barnett added the comment: Oops! :-( Now corrected. -- Added file: http://bugs.python.org/file28332/issue16688#3.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
Changes by Matthew Barnett pyt...@mrabarnett.plus.com: Removed file: http://bugs.python.org/file28330/issue16688#3.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
Serhiy Storchaka added the comment: LGTM. Matthew, can you please submit a contributor form? http://python.org/psf/contrib/contrib-form/ http://python.org/psf/contrib/ -- stage: patch review - commit review ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
Serhiy Storchaka added the comment: Good analysis, Matthew. Are you want to submit a patch? -- keywords: +easy stage: - needs patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
Matthew Barnett added the comment: OK, here's a patch. -- keywords: +patch Added file: http://bugs.python.org/file28321/issue16688.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
Changes by Serhiy Storchaka storch...@gmail.com: -- stage: needs patch - patch review ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
STINNER Victor added the comment: Can someone check if there is no other similar regression (introduced by the PEP 393)? 2012/12/15 Serhiy Storchaka rep...@bugs.python.org: Changes by Serhiy Storchaka storch...@gmail.com: -- stage: needs patch - patch review ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
Matthew Barnett added the comment: I found another bug while looking through the source. On line 495 in function SRE_COUNT: if (maxcount end - ptr maxcount != 65535) end = ptr + maxcount*state-charsize; where 'end' and 'ptr' are of type 'char*'. That means that 'end - ptr' is the length in _bytes_, not characters. If the byte after the end of the string is 0 then you get this: # Good: re.search(r\x00{1,3}, a\x00\x00).span() (1, 3) # Bad: re.search(r\x00{1,3}, \u0100\x00\x00).span() (1, 4) I'll keep looking before submitting a patch. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
Matthew Barnett added the comment: I found another bug while looking through the source. On line 495 in function SRE_COUNT: if (maxcount end - ptr maxcount != 65535) end = ptr + maxcount*state-charsize; where 'end' and 'ptr' are of type 'char*'. That means that 'end - ptr' is the length in _bytes_, not characters. If the byte after the end of the string is 0 then you get this: # Good: re.search(r\x00{1,3}, a\x00\x00).span() (1, 3) # Bad: re.search(r\x00{1,3}, \u0100\x00\x00).span() (1, 4) I'll keep looking before submitting a patch. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
Matthew Barnett added the comment: I haven't found any other issues, so here's the second patch. -- Added file: http://bugs.python.org/file28325/issue16688#2.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
New submission from pyos: The title says it all: if a regular expression that makes use of backreferences is compiled with `re.I` flag, it will always fail when matched against a string that contains characters outside of U+-U+00FF range. I've been unable to further narrow the bug down. A simple example: import re r = re.compile(r'(a)\1', re.I) # should match aa, aA, Aa, or AA r.findall('aa') # works as expected ['a'] r.findall('aa bcd') # still works ['a'] r.findall('aa Ā') # ord('Ā') == 0x0100 [] The same code works as expected in Python 3.2: r.findall('aa Ā') ['a'] -- components: Regular Expressions messages: 177518 nosy: ezio.melotti, mrabarnett, pitrou, pyos priority: normal severity: normal status: open title: Backreferences make case-insensitive regex fail on non-ASCII strings. type: behavior versions: Python 3.3 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
Changes by STINNER Victor victor.stin...@gmail.com: -- nosy: +haypo ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
Changes by STINNER Victor victor.stin...@gmail.com: -- nosy: +serhiy.storchaka ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
Ezio Melotti added the comment: It works on 2.7 too, and fails on 3.3/3.x. Maybe it's related to PEP 393? -- versions: +Python 3.4 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
Changes by Arfrever Frehtes Taifersar Arahesis arfrever@gmail.com: -- nosy: +Arfrever ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16688] Backreferences make case-insensitive regex fail on non-ASCII strings.
Matthew Barnett added the comment: In function SRE_MATCH, the code for SRE_OP_GROUPREF (line 1290) contains this: while (p e) { if (ctx-ptr = end || SRE_CHARGET(state, ctx-ptr, 0) != SRE_CHARGET(state, p, 0)) RETURN_FAILURE; p += state-charsize; ctx-ptr += state-charsize; } However, the code for SRE_OP_GROUPREF_IGNORE (line 1316) contains this: while (p e) { if (ctx-ptr = end || state-lower(SRE_CHARGET(state, ctx-ptr, 0)) != state-lower(*p)) RETURN_FAILURE; p++; ctx-ptr += state-charsize; } (In both cases 'p' is of type 'char*'.) The problem appears to be that the latter is still using '*p' and 'p++' and is thus always working with chars (it gets and advances 1 byte at a time instead of 1, 2 or 4 bytes for Unicode). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16688 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com