Serhiy Storchaka added the comment: I prefer a little different (simpler for me) form:
for (p = collstart; p < collend;) { Py_UCS4 ch = *p++; if ((0xD800 <= ch && ch <= 0xDBFF) && (p < collend) && (0xDC00 <= *p && *p <= 0xDFFF)) { ch = ((((ch & 0x03FF) << 10) | ((Py_UCS4)*p++ & 0x03FF)) + 0x10000); } str += sprintf(str, "&#%d;", (int)ch); } And please look at the loop above ("determine replacement size"). It should be corrected too. It will be simpler to use a buffer with static size (``char buffer[2+29+1+1];``) as in charmap encoder. Perhaps charmap encoder should be fixed too (and common code extracted to separate function). I doubt about '\ud83d\udc9d' on wide build. Is it right to encode it as b'💝' and not as b'��'? This will be compatible with narrow build but will break compatibility with 3.3+. What is less evil? ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue15866> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com