Vlastimil Brom <vlastimil.b...@gmail.com> added the comment: I like the idea of the general "new" flag introducing the reasonable, backwards incompatible behaviour; one doesn't have to remember a list of non-standard flags to get this features.
While I recognise, that the module probably can't work correctly with wide unicode characters on a narrow python build (py 2.7, win XP in this case), i noticed a difference to re in this regard (it might be based on the absence of the wide unicode literal in the latter). re.findall(u"\\U00010337", u"a\U00010337bc") [] re.findall(u"(?i)\\U00010337", u"a\U00010337bc") [] regex.findall(u"\\U00010337", u"a\U00010337bc") [] regex.findall(u"(?i)\\U00010337", u"a\U00010337bc") Traceback (most recent call last): File "<input>", line 1, in <module> File "C:\Python27\lib\regex.py", line 203, in findall return _compile(pattern, flags).findall(string, pos, endpos, File "C:\Python27\lib\regex.py", line 310, in _compile parsed = parsed.optimise(info) File "C:\Python27\lib\_regex_core.py", line 1735, in optimise if self.is_case_sensitive(info): File "C:\Python27\lib\_regex_core.py", line 1727, in is_case_sensitive return char_type(self.value).lower() != char_type(self.value).upper() ValueError: unichr() arg not in range(0x10000) (narrow Python build) I.e. re fails to match this pattern (as it actually looks for "U00010337" ), regex doesn't recognise the wide unicode as surrogate pair either, but it also raises an error from narrow unichr. Not sure, whether/how it should be fixed, but the difference based on the i-flag seems unusual. Of course it would be nice, if surrogate pairs were interpreted, but I can imagine, that it would open a whole can of worms, as this is not thoroughly supported in the builtin unicode either (len, indices, slicing). I am trying to make wide unicode characters somehow usable in my app, mainly with hacks like extended unichr ("\U"+hex(67)[2:].zfill(8)).decode("unicode-escape") or likewise for ord surrog_ord = (ord(first) - 0xD800) * 0x400 + (ord(second) - 0xDC00) + 0x10000 Actually, using regex, one can work around some of these limitations of len, index or slice using a list form of the string containing surrogates regex.findall(ur"(?s)(?:\p{inHighSurrogates}\p{inLowSurrogates})|.", u"ab𐌷𐌸𐌹cd") [u'a', u'b', u'\U00010337', u'\U00010338', u'\U00010339', u'c', u'd'] but apparently things like wide unicode literals or character sets (even extending of the shorthands like \w etc.) are much more complicated. regards, vbr ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue2636> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com