New submission from Tom Christiansen <tchr...@perl.com>:

On neither narrow nor wide builds does this UTF8-encoded bit run without 
raising an exception: 

   if re.search("[π’œ-𝒡]", "π’ž", re.UNICODE): 
       print("match 1 passed")
   else:
       print("match 2 failed")

The best you can possibly do is to use both a wide build *and* symbolic 
literals, in which case it will pass. But remove either of both of those 
conditions and you fail.  This is too restrictive for full Unicode use. 

There should never be any sitation where [a-z] fails to match c when a < c < z, 
and neither a nor z is something special in a character class.  There is, or 
perhaps should be, no difference at all between "[a-z]" and "[π’œ-𝒡]", just as 
there is, or at least should b, no difference between "c" and "π’ž". You can’t 
have second-class citizens like this that can't be used.

And no, this one is *not* fixed by Matthew Barnett's regex library. There is 
some dumb UCS-2 assumption lurking deep in Python somewhere that makes this 
break, even on wide builds, which is incomprehensible to me.

----------
components: Regular Expressions
files: bigrange.py
messages: 142058
nosy: Arfrever, ezio.melotti, jkloth, mrabarnett, pitrou, r.david.murray, 
tchrist, terry.reedy
priority: normal
severity: normal
status: open
title: lib re cannot match non-BMP ranges (all versions, all builds)
type: behavior
versions: Python 3.2
Added file: http://bugs.python.org/file22897/bigrange.py

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12749>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to