[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Ezio Melotti Sun, 14 Aug 2011 21:57:01 -0700

Ezio Melotti <ezio.melo...@gmail.com> added the comment:

Keep in mind that we should be able to access and use lone surrogates too, 
therefore:
s = '\ud800'  # should be valid
len(s)  # should this raise an error? (or return 0.5 ;)?
s[0]  # error here too?
list(s)  # here too?


p = s + '\udc00'
len(p)  # 1?
s[0]  # '\U00010000' ?
s[1]  # IndexError?
list(p + 'a')  # ['\ud800\udc00', 'a']?

We can still decide that strings with lone surrogates work only with a limited 
number of methods/functions but:
1) it's not backward compatible;
2) it's not very consistent

Another thing I noticed is that (at least on wide builds) surrogate pairs are 
not joined "on the fly":
>>> p
'\ud800\udc00'
>>> len(p)
2
>>> p.encode('utf-16').decode('utf-16')
'𐀀'
>>> len(_)
1

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12729>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Reply via email to