[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Tom Christiansen Thu, 11 Aug 2011 12:04:07 -0700

New submission from Tom Christiansen <[email protected]>:

Python is in flagrant violation of the very most basic premises of Unicode 
Technical Report #18 on Regular Expressions, which requires that a regex engine 
support Unicode characters as "basic logical units independent of serialization 
like UTF‑*".  Because sometimes you must specify ".." to match a single Unicode 
character -- whenever those code points are above the BMP and you are on a 
narrow build -- Python regexes cannot be reliably used for Unicode text.


 % python3.2
 Python 3.2 (r32:88445, Jul 21 2011, 14:44:19)
 [GCC 4.2.1 (Apple Inc. build 5664)] on darwin
 Type "help", "copyright", "credits" or "license" for more information.
 >>> import re
 >>> g = "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}"
 >>> print(g)
ᾲ
 >>> print(re.search(r'\w', g))
 <_sre.SRE_Match object at 0x10051f988>
 >>> p = "\N{MATHEMATICAL SCRIPT CAPITAL P}"
 >>> print(p)
𝒫
 >>> print(re.search(r'\w', p))
None
 >>> print(re.search(r'..', p))   # ← 𝙏𝙃𝙄𝙎 𝙄𝙎 𝙏𝙃𝙀 𝙑𝙄𝙊𝙇𝘼𝙏𝙄𝙊𝙉 𝙍𝙄𝙂𝙃𝙏 𝙃𝙀𝙍𝙀 
<_sre.SRE_Match object at 0x10051f988>
 >>> print(len(chr(0x1D4AB)))
2

That is illegal in Unicode regular expressions.

----------
components: Regular Expressions
messages: 141917
nosy: tchrist
priority: normal
severity: normal
status: open
title: Python lib re cannot handle Unicode properly due to narrow/wide bug
type: behavior
versions: Python 2.7

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue12729>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Reply via email to