Ezio Melotti <[email protected]> added the comment:
Keep in mind that we should be able to access and use lone surrogates too,
therefore:
s = '\ud800' # should be valid
len(s) # should this raise an error? (or return 0.5 ;)?
s[0] # error here too?
list(s) # here too?
p = s + '\udc00'
len(p) # 1?
s[0] # '\U00010000' ?
s[1] # IndexError?
list(p + 'a') # ['\ud800\udc00', 'a']?
We can still decide that strings with lone surrogates work only with a limited
number of methods/functions but:
1) it's not backward compatible;
2) it's not very consistent
Another thing I noticed is that (at least on wide builds) surrogate pairs are
not joined "on the fly":
>>> p
'\ud800\udc00'
>>> len(p)
2
>>> p.encode('utf-16').decode('utf-16')
'𐀀'
>>> len(_)
1
----------
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue12729>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com