Ezio Melotti <ezio.melo...@gmail.com> added the comment: Keep in mind that we should be able to access and use lone surrogates too, therefore: s = '\ud800' # should be valid len(s) # should this raise an error? (or return 0.5 ;)? s[0] # error here too? list(s) # here too?
p = s + '\udc00' len(p) # 1? s[0] # '\U00010000' ? s[1] # IndexError? list(p + 'a') # ['\ud800\udc00', 'a']? We can still decide that strings with lone surrogates work only with a limited number of methods/functions but: 1) it's not backward compatible; 2) it's not very consistent Another thing I noticed is that (at least on wide builds) surrogate pairs are not joined "on the fly": >>> p '\ud800\udc00' >>> len(p) 2 >>> p.encode('utf-16').decode('utf-16') '𐀀' >>> len(_) 1 ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12729> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com